Mastering Regression Calibration: From Theoretical Basics to Advanced Methods

1. Introduction to Calibration

In regression, we often care about more than just a single point estimate (the mean). We want to know the uncertainty associated with that prediction. Calibration is the process of ensuring that our uncertainty estimates (like prediction intervals) are “honest.”

What is a Prediction Interval?

A prediction interval for a given target \(y\) at a confidence level \((1 - \alpha)\) is an interval \([L, U]\) such that:

\[P(y \in [L, U]) \geq 1 - \alpha\]

If we say we have a 90% prediction interval (\(\alpha = 0.1\)), we expect that in 90% of future cases, the true value will fall within those bounds.

The Need for Calibration

Most machine learning models (Boosted Trees, Neural Networks, etc.) are optimized to minimize a loss function (like MSE) but do not inherently produce calibrated probability distributions or intervals. Calibration methods take a pre-trained model and “fine-tune” its uncertainty estimates using a held-out Calibration Set.

In this tutorial, we will explore two families of calibration methods:

  1. Perpetual-native methods: Leverage internal statistics from the PerpetualBooster (leaf fold weights).

  2. MAPIE methods: Model-agnostic conformal prediction methods that work with any regressor.

[ ]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from mapie.regression import (
    ConformalizedQuantileRegressor,
    CrossConformalRegressor,
    SplitConformalRegressor,
)
from perpetual import PerpetualBooster
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")
print("Libraries imported successfully.")

2. Dataset Preparation

We use the California Housing dataset.

  • Train Set: Used to train the base model.

  • Calibration Set: Used to learn the residuals or interval scaling.

  • Test Set: Used for final evaluation of coverage and width.

[ ]:
# Load data
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

# Split into Train (60%), Calibration (20%), and Test (20%)
X_rest, X_test, y_rest, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_cal, y_train, y_cal = train_test_split(
    X_rest, y_rest, test_size=0.25, random_state=42
)

print(f"Train size:       {len(X_train)}")
print(f"Calibration size: {len(X_cal)}")
print(f"Test size:        {len(X_test)}")

3. Training the Base Model

We train a PerpetualBooster on the training set.

[ ]:
model = PerpetualBooster(objective="SquaredLoss", budget=1.0)
model.fit(X_train, y_train)

test_preds = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, test_preds))
print(f"Base Model RMSE: {rmse:.4f}")

4. Perpetual Calibration Methods

Perpetual provides four algorithms to derive prediction intervals from the learned trees.

4.1 Theory

  • Conformal: A traditional split-conformal approach where absolute residuals on the calibration set determine a quantile correction.

  • MinMax: Each leaf stores the min/max weight across folds. The interval is inferred from the sum of these ranges across trees.

  • GRP (Global Relative Position): A more refined version of MinMax using the relative position of the target within weight spans.

  • Weight Variance: Estimates uncertainty based on the standard deviation of fold weights in reached leaves.

[ ]:
methods = ["Conformal", "MinMax", "GRP", "WeightVariance"]
alpha = 0.1  # Goal: 90% coverage
results = []
method_data = {}  # To store intervals for viz

for method in methods:
    print(f"Running Perpetual {method}...")
    m = PerpetualBooster(objective="SquaredLoss", budget=0.5, save_node_stats=True)
    m.fit(X_train, y_train)

    if method == "Conformal":
        m.calibrate_conformal(X_train, y_train, X_cal, y_cal, alpha=[alpha])
    else:
        m.calibrate(X_cal, y_cal, alpha=[alpha], method=method)

    ints = m.predict_intervals(X_test)
    lower = ints[str(alpha)][:, 0]
    upper = ints[str(alpha)][:, 1]

    coverage = np.mean((y_test >= lower) & (y_test <= upper))
    width = np.mean(upper - lower)
    results.append(
        {
            "Library": "Perpetual",
            "Method": method,
            "Coverage": coverage,
            "Avg Width": width,
        }
    )
    method_data[f"Perp_{method}"] = (lower, upper)

5. MAPIE Calibration Methods

MAPIE implements variants of conformal prediction.

5.1 Overview

  • Split: Residuals on a held-out set. Usually constant width.

  • Jackknife / CV: Resampling methods providing tighter bounds at the cost of training time.

  • CQR (Conformalized Quantile Regression): Uses a quantile-based model to produce adaptive widths.

[ ]:
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor

base_est = GradientBoostingRegressor(random_state=42)
target_confidence = 1 - alpha

mapie_configs = [
    ("Split", "split", None),
    ("Jackknife+", "plus", 20),  # Use high-k CV instead of LOO for speed
    ("Jackknife-minmax", "minmax", 20),  # on large datasets (12k samples)
    ("CV+", "plus", 5),
    ("CV-minmax", "minmax", 5),
]

for name, method, cv_strategy in mapie_configs:
    print(f"Running MAPIE {name}...")
    if method == "split":
        base_est.fit(X_train, y_train)
        # In 1.3.0, SplitConformalRegressor takes confidence_level and prefit=True
        mapie = SplitConformalRegressor(
            base_est, confidence_level=target_confidence, prefit=True
        )
        mapie.conformalize(X_cal, y_cal)
    else:
        # In 1.3.0, CrossConformalRegressor uses fit_conformalize
        mapie = CrossConformalRegressor(
            base_est, confidence_level=target_confidence, method=method, cv=cv_strategy
        )
        mapie.fit_conformalize(X_rest, y_rest)

    _, y_pis = mapie.predict_interval(X_test)
    # Shape is (n_samples, 2, n_confidence_levels)
    lower, upper = y_pis[:, 0, 0], y_pis[:, 1, 0]

    coverage = np.mean((y_test >= lower) & (y_test <= upper))
    width = np.mean(upper - lower)
    results.append(
        {"Library": "MAPIE", "Method": name, "Coverage": coverage, "Avg Width": width}
    )
    method_data[f"MAPIE_{name}"] = (lower, upper)

print("Running MAPIE CQR...")
cqr_est = HistGradientBoostingRegressor(loss="quantile", quantile=0.5, random_state=42)
mapie_cqr = ConformalizedQuantileRegressor(cqr_est, confidence_level=target_confidence)
mapie_cqr.fit(X_train, y_train).conformalize(X_cal, y_cal)
_, y_pis = mapie_cqr.predict_interval(X_test)
lower, upper = y_pis[:, 0, 0], y_pis[:, 1, 0]
coverage = np.mean((y_test >= lower) & (y_test <= upper))
width = np.mean(upper - lower)
results.append(
    {"Library": "MAPIE", "Method": "CQR", "Coverage": coverage, "Avg Width": width}
)
method_data["MAPIE_CQR"] = (lower, upper)

6. Comprehensive Result Comparison

We compare all methods side-by-side. The key metrics are:

  1. Coverage: Should be very close to 0.90.

  2. Avg Width: Smaller is better (means tighter intervals with same confidence).

[ ]:
df_res = pd.DataFrame(results)
df_res["Coverage Error"] = np.abs(df_res["Coverage"] - (1 - alpha))
df_res.sort_values(by=["Library", "Avg Width"], ascending=True, inplace=True)
display(df_res)

7. Global Visualization

Let’s visualize the tradeoff between interval width and coverage.

[ ]:
plt.figure(figsize=(14, 7))
sns.scatterplot(
    data=df_res, x="Avg Width", y="Coverage", hue="Library", style="Method", s=200
)
plt.axhline(1 - alpha, color="red", linestyle="--", label="Target Coverage (90%)")
plt.title("Calibration Benchmarking: Coverage vs. Efficiency (Width)")
plt.ylabel("Observed Coverage")
plt.xlabel("Average Interval Width (lower is better)")
plt.grid(True)
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.show()

8. Analyzing Interval Width Distributions (Adaptive vs. Fixed)

One of the hallmarks of a good calibration method is its ability to produce adaptive intervals.

  • Fixed-width methods: produce the same uncertainty regardless of the input sample.

  • Adaptive methods: identify regions of high uncertainty and widen the bounds locally.

[ ]:
width_data = []
for name, (low, high) in method_data.items():
    widths = high - low
    for w in widths:
        width_data.append(
            {
                "Method": name,
                "Width": w,
                "Library": "Perpetual" if "Perp" in name else "MAPIE",
            }
        )

df_widths = pd.DataFrame(width_data)

plt.figure(figsize=(15, 8))
sns.boxplot(data=df_widths, x="Method", y="Width", hue="Library")
plt.xticks(rotation=45)
plt.title("Distribution of Prediction Interval Widths Across Methods")
plt.ylabel("Interval Width")
plt.show()

9. Comparison of Calibration Approaches

Perpetual’s calibration methods offer distinct advantages depending on the use case:

1. Perpetual Conformal vs. MAPIE CQR

  • Perpetual Conformal is conceptually similar to MAPIE CQR (Conformalized Quantile Regression). Both aim to provide adaptive intervals that vary based on the predicted uncertainty.

  • Both methods typically require a calibration set to determine the final scaling or quantile adjustments necessary to guarantee the desired coverage level.

2. Advantages of Perpetual’s Internal Methods (MinMax, GRP, WeightVariance)

  • No Model Re-training: Unlike methods that might require training multiple models (like Jackknife/CV) or specific quantile regressors, Perpetual’s internal methods (MinMax, GRP, WeightVariance) leverage statistics collected during the initial training pass (with save_node_stats=True).

  • Efficiency: Because the necessary information (fold weights/ranges) is already part of the trained model structure, calibrating these methods is extremely fast, involving only lightweight calculations on the calibration set.

  • Separate Calibration Set: It is important to note that all methods, including Perpetual’s internal ones, still require a separate held-out calibration set to strictly validate coverage guarantees. While you save on training time/resources, you must reserve data for calibration to ensure valid prediction intervals.