Tutorial: Regression

Point predictions from conditional density estimation

Partition Trees estimate the conditional density \(p(y \mid x)\) as a piecewise-constant function. For regression, the point prediction is the posterior mean of that density — equivalent to the conditional expectation \(\mathbb{E}[y \mid x]\).

NoteWant full predictive distributions?

If you need prediction intervals, quantiles, or the full PDF/CDF, see the Probabilistic Regression tutorial which uses the partition_tree.skpro interface.

Setup

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

1. Baselines — Decision Tree & Random Forest

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

cart = DecisionTreeRegressor(random_state=42)
cart.fit(X_train, y_train)
y_pred_cart = cart.predict(X_test)

print("=== DecisionTreeRegressor ===")
print(f"MAE  : {mean_absolute_error(y_test, y_pred_cart):.4f}")
print(f"RMSE : {np.sqrt(((y_test - y_pred_cart) ** 2).mean()):.4f}")
print(f"R²   : {r2_score(y_test, y_pred_cart):.4f}")
=== DecisionTreeRegressor ===
MAE  : 0.4547
RMSE : 0.7037
R²   : 0.6221
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("=== RandomForestRegressor ===")
print(f"MAE  : {mean_absolute_error(y_test, y_pred_rf):.4f}")
print(f"RMSE : {np.sqrt(((y_test - y_pred_rf) ** 2).mean()):.4f}")
print(f"R²   : {r2_score(y_test, y_pred_rf):.4f}")
=== RandomForestRegressor ===
MAE  : 0.3303
RMSE : 0.5072
R²   : 0.8037

2. Single Partition Tree

from partition_tree.sklearn import PartitionTreeRegressor

reg = PartitionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)
y_pred_reg = reg.predict(X_test)

print("=== PartitionTreeRegressor ===")
print(f"MAE  : {mean_absolute_error(y_test, y_pred_reg):.4f}")
print(f"RMSE : {np.sqrt(((y_test - y_pred_reg) ** 2).mean()):.4f}")
print(f"R²   : {r2_score(y_test, y_pred_reg):.4f}")
=== PartitionTreeRegressor ===
MAE  : 0.3876
RMSE : 0.5787
R²   : 0.7445

Comparison with Baselines

import pandas as pd

pd.DataFrame({
    "Model": ["DecisionTree (CART)", "RandomForest", "PartitionTree"],
    "MAE": [
        mean_absolute_error(y_test, y_pred_cart),
        mean_absolute_error(y_test, y_pred_rf),
        mean_absolute_error(y_test, y_pred_reg),
    ],
    "RMSE": [
        np.sqrt(((y_test - y_pred_cart) ** 2).mean()),
        np.sqrt(((y_test - y_pred_rf) ** 2).mean()),
        np.sqrt(((y_test - y_pred_reg) ** 2).mean()),
    ],
    "R²": [
        r2_score(y_test, y_pred_cart),
        r2_score(y_test, y_pred_rf),
        r2_score(y_test, y_pred_reg),
    ],
}).round(4)
Model MAE RMSE
0 DecisionTree (CART) 0.4547 0.7037 0.6221
1 RandomForest 0.3303 0.5072 0.8037
2 PartitionTree 0.3876 0.5787 0.7445

3. Partition Forest (Ensemble)

PartitionForestRegressor averages the conditional densities of multiple trees, then reports the posterior mean — similar in spirit to a Random Forest but built on the Partition Tree density framework.

from partition_tree.sklearn import PartitionForestRegressor

forest_reg = PartitionForestRegressor(
    n_estimators=50,
    random_state=42,
    min_volume_fraction=0.1,
    min_samples_xy=0,
)
forest_reg.fit(X_train, y_train)
y_pred_forest = forest_reg.predict(X_test)

print("=== PartitionForestRegressor ===")
print(f"MAE  : {mean_absolute_error(y_test, y_pred_forest):.4f}")
print(f"RMSE : {np.sqrt(((y_test - y_pred_forest) ** 2).mean()):.4f}")
print(f"R²   : {r2_score(y_test, y_pred_forest):.4f}")
=== PartitionForestRegressor ===
MAE  : 0.3384
RMSE : 0.5078
R²   : 0.8032

Full Comparison

pd.DataFrame({
    "Model": ["DecisionTree (CART)", "RandomForest", "PartitionTree", "PartitionForest"],
    "MAE": [
        mean_absolute_error(y_test, y_pred_cart),
        mean_absolute_error(y_test, y_pred_rf),
        mean_absolute_error(y_test, y_pred_reg),
        mean_absolute_error(y_test, y_pred_forest),
    ],
    "RMSE": [
        np.sqrt(((y_test - y_pred_cart) ** 2).mean()),
        np.sqrt(((y_test - y_pred_rf) ** 2).mean()),
        np.sqrt(((y_test - y_pred_reg) ** 2).mean()),
        np.sqrt(((y_test - y_pred_forest) ** 2).mean()),
    ],
    "R²": [
        r2_score(y_test, y_pred_cart),
        r2_score(y_test, y_pred_rf),
        r2_score(y_test, y_pred_reg),
        r2_score(y_test, y_pred_forest),
    ],
}).round(4)
Model MAE RMSE
0 DecisionTree (CART) 0.4547 0.7037 0.6221
1 RandomForest 0.3303 0.5072 0.8037
2 PartitionTree 0.3876 0.5787 0.7445
3 PartitionForest 0.3384 0.5078 0.8032

4. Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(
    PartitionTreeRegressor(random_state=42),
    X,
    y,
    cv=5,
    scoring="r2",
)
print(f"CV R²: {scores.mean():.4f} ± {scores.std():.4f}")
CV R²: 0.5384 ± 0.1111

5. Key Hyperparameters

Parameter Default Description
max_leaves 101 Maximum number of leaves
max_depth 5 Maximum tree depth
min_samples_split 2.0 Min samples needed to attempt a split
min_gain 0.0 Min gain required to accept a split
min_volume_fraction 0.0 Min fraction of root \(Y\)-volume for a leaf
boundaries_expansion_factor 0.1 Padding for the outcome bounding box
n_estimators 100 Number of trees (forest only)
max_samples 0.8 Bootstrap fraction (forest only)
max_features 0.8 Feature subsampling fraction (forest only)
random_state 42 Random seed (forest only)
TipTip

For regression, increasing max_leaves and max_depth improves fit but risks overfitting. Use cross-validation (e.g., cross_val_score with scoring="r2") to find the right balance.

TipTip

min_samples_split controls the minimum number of samples in a node before the tree even considers splitting it. Setting it to a value like 5.0 or 10.0 is a simple but effective regularizer for noisy datasets.

6. Input Formats

The estimators accept:

  • NumPy arrays — standard (n_samples, n_features) float arrays.
  • Pandas DataFrames — column names are preserved.
  • Multi-output — pass a 2-D y array or DataFrame with multiple columns.
  • Missing valuesNaN values are supported (allow_nan = True).