Classification Calibration: Prediction Sets with Perpetual

1. Introduction

In classification tasks, a model typically outputs a probability distribution over classes. However, selecting the class with the highest probability is often not enough, especially in high-stakes decision-making. We want to know the uncertainty of our predictions.

Calibration in classification often refers to ensuring that the predicted probabilities reflect true frequencies. However, another powerful approach is Conformal Prediction, which constructs Prediction Sets.

A prediction set \(\mathcal{C}(x)\) is a set of classes such that the true label \(y\) is contained in \(\mathcal{C}(x)\) with a high probability \((1 - \alpha)\):

\[P(y \in \mathcal{C}(x)) \geq 1 - \alpha\]

For example, if \(\alpha = 0.1\), we want the true class to be in the predicted set significantly 90% of the time. The goal is to maximize the “efficiency” of these sets (i.e., minimize their average size) while maintaining the coverage guarantee.

PerpetualBooster provides built-in methods to generate these calibrated prediction sets.

[ ]:

import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from lightgbm import LGBMClassifier
from mapie.classification import CrossConformalClassifier, SplitConformalClassifier
from perpetual import (
    PerpetualBooster,
    compute_calibration_curve,
    expected_calibration_error,
)
from sklearn.calibration import CalibratedClassifierCV
from sklearn.datasets import fetch_covtype
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")

2. Dataset Preparation

We will use the Covertype dataset, a classic benchmark for classification. To make the problem more illustrative for prediction sets (where we might capture uncertainty between two dominant classes), we will convert it into a binary classification task: distinguishing Class 2 (Lodgepole Pine) from all others. Class 2 covers approx 48.75% of the data, making it a balanced problem.

[ ]:

print("Loading Covertype dataset...")
data = fetch_covtype()
X, y_orig = data.data, data.target

# Convert to binary: Class 2 vs Rest
y = (y_orig == 2).astype(int)

# Subsample for tutorial speed (optional, remove for full run)
idx = np.arange(len(y))
np.random.seed(42)
np.random.shuffle(idx)
X = X[idx[:50000]]
y = y[idx[:50000]]

# Split: Train (60%), Calibration (20%), Test (20%)
X_rest, X_test, y_rest, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_cal, y_train, y_cal = train_test_split(
    X_rest, y_rest, test_size=0.25, random_state=42
)

print(f"Train size:       {len(X_train)}")
print(f"Calibration size: {len(X_cal)}")
print(f"Test size:        {len(X_test)}")

3. Training the Base Model

We train a PerpetualBooster with the LogLoss objective. We set save_node_stats=True to enable internal calibration methods like WeightVariance.

[ ]:

# Note: PerpetualBooster is deterministic; no random_state parameter needed.
model = PerpetualBooster(objective="LogLoss", budget=1.0, save_node_stats=True)
model.fit(X_train, y_train)

preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"Base Model Accuracy: {acc:.4f}")

4. Calibrating Prediction Sets

We will now calibrate the model to produce prediction sets at three coverage levels: \(\alpha = 0.1\) (90%), \(\alpha = 0.05\) (95%), and \(\alpha = 0.01\) (99%).

Perpetual offers:

Conformal: Standard split-conformal prediction (sets based on probability thresholds).
WeightVariance / MinMax: Adaptive methods that leverage the internal variance of the ensemble to scale uncertainty.
GRP: Adaptive method using Generalized Residual Prediction (log-odds percentiles) to scale uncertainty.

[ ]:

methods = ["Conformal", "WeightVariance", "MinMax", "GRP"]
alphas = [0.1, 0.05, 0.01]
results = []

for method in methods:
    print(f"Calibrating with {method}...")
    # We calibrate on the held-out calibration set
    model.calibrate(X_cal, y_cal, alpha=alphas, method=method)

    # Predict sets on test set
    prediction_sets = model.predict_sets(X_test)

    for alpha in alphas:
        alpha_str = str(float(alpha))
        sets = prediction_sets[alpha_str]

        # Calculate metrics
        covered = 0
        set_sizes = []
        for i, s in enumerate(sets):
            if y_test[i] in s:
                covered += 1
            set_sizes.append(len(s))

        coverage = covered / len(y_test)
        avg_size = np.mean(set_sizes)

        results.append(
            {
                "Library": "Perpetual",
                "Method": method,
                "Alpha": alpha,
                "Target Coverage": 1 - alpha,
                "Observed Coverage": coverage,
                "Avg Set Size": avg_size,
            }
        )

5. Probability Calibration

In addition to prediction sets, PerpetualBooster supports calibrating the predicted probabilities themselves. This ensures that the predicted probability reflects the true frequency of the positive class.

Probability calibration is performed automatically during calibrate. You can access calibrated probabilities by setting calibrated=True in predict_proba.

[ ]:

# Ensure the model is calibrated (method='Conformal' is sufficient as it triggers internal calibration)
model.calibrate(X_cal, y_cal, method="Conformal", alpha=0.1)

# Get uncalibrated and calibrated probabilities
probs_uncal = model.predict_proba(X_test, calibrated=False)[:, 1]
probs_cal = model.predict_proba(X_test, calibrated=True)[:, 1]

# Compute Calibration Curves
true_uncal, pred_uncal = compute_calibration_curve(y_test, probs_uncal, n_bins=10)
true_cal, pred_cal = compute_calibration_curve(y_test, probs_cal, n_bins=10)

# Compute Expected Calibration Error (ECE)
ece_uncal = expected_calibration_error(y_test, probs_uncal, n_bins=10)
ece_cal = expected_calibration_error(y_test, probs_cal, n_bins=10)

print(f"Uncalibrated ECE: {ece_uncal:.4f}")
print(f"Calibrated ECE:   {ece_cal:.4f}")

# Plot Reliability Diagram
plt.figure(figsize=(8, 8))
plt.plot([0, 1], [0, 1], "k:", label="Perfectly Calibrated")
plt.plot(pred_uncal, true_uncal, "s-", label=f"Uncalibrated (ECE={ece_uncal:.4f})")
plt.plot(pred_cal, true_cal, "o-", label=f"Calibrated (ECE={ece_cal:.4f})")
plt.xlabel("Mean Predicted Probability")
plt.ylabel("Fraction of Positives")
plt.title("Reliability Diagram")
plt.legend()
plt.show()

6. Comparison with Scikit-Learn and LightGBM

We compare the calibration of PerpetualBooster against scikit-learn’s HistGradientBoostingClassifier and LightGBM’s LGBMClassifier. For both competitors, we evaluate both the uncalibrated model and a calibrated version using Isotonic Regression (via CalibratedClassifierCV).

[ ]:

# 1. Scikit-Learn HistGradientBoosting
hgb = HistGradientBoostingClassifier(random_state=42)
hgb.fit(X_train, y_train)
hgb_cal = CalibratedClassifierCV(hgb, method="isotonic", cv="prefit")
hgb_cal.fit(X_cal, y_cal)

probs_hgb_uncal = hgb.predict_proba(X_test)[:, 1]
probs_hgb_cal = hgb_cal.predict_proba(X_test)[:, 1]

ece_hgb_uncal = expected_calibration_error(y_test, probs_hgb_uncal, n_bins=10)
ece_hgb_cal = expected_calibration_error(y_test, probs_hgb_cal, n_bins=10)

# 2. LightGBM
lgbm = LGBMClassifier(random_state=42, verbose=-1)
lgbm.fit(X_train, y_train)
lgbm_cal = CalibratedClassifierCV(lgbm, method="isotonic", cv="prefit")
lgbm_cal.fit(X_cal, y_cal)

probs_lgbm_uncal = lgbm.predict_proba(X_test)[:, 1]
probs_lgbm_cal = lgbm_cal.predict_proba(X_test)[:, 1]

ece_lgbm_uncal = expected_calibration_error(y_test, probs_lgbm_uncal, n_bins=10)
ece_lgbm_cal = expected_calibration_error(y_test, probs_lgbm_cal, n_bins=10)

# 3. Compute Curves
true_hgb_uncal, pred_hgb_uncal = compute_calibration_curve(
    y_test, probs_hgb_uncal, n_bins=10
)
true_hgb_cal, pred_hgb_cal = compute_calibration_curve(y_test, probs_hgb_cal, n_bins=10)
true_lgbm_uncal, pred_lgbm_uncal = compute_calibration_curve(
    y_test, probs_lgbm_uncal, n_bins=10
)
true_lgbm_cal, pred_lgbm_cal = compute_calibration_curve(
    y_test, probs_lgbm_cal, n_bins=10
)

print(f"Perpetual (Uncalibrated) ECE: {ece_uncal:.4f}")
print(f"Perpetual (Calibrated)   ECE: {ece_cal:.4f}")
print(f"Sklearn HGB (Uncalibrated) ECE: {ece_hgb_uncal:.4f}")
print(f"Sklearn HGB (Calibrated)   ECE: {ece_hgb_cal:.4f}")
print(f"LightGBM (Uncalibrated)    ECE: {ece_lgbm_uncal:.4f}")
print(f"LightGBM (Calibrated)      ECE: {ece_lgbm_cal:.4f}")

# 4. Plot Comparison
plt.figure(figsize=(10, 10))
plt.plot([0, 1], [0, 1], "k:", label="Perfectly Calibrated")

plt.plot(
    pred_uncal,
    true_uncal,
    "o--",
    label=f"Perpetual (Uncalibrated, ECE={ece_uncal:.4f})",
    color="#1f77b4",
    alpha=0.6,
)
plt.plot(
    pred_cal,
    true_cal,
    "o-",
    label=f"Perpetual (Calibrated, ECE={ece_cal:.4f})",
    color="#1f77b4",
    linewidth=2,
)

plt.plot(
    pred_hgb_uncal,
    true_hgb_uncal,
    "s--",
    label=f"Sklearn HGB (Uncalibrated, ECE={ece_hgb_uncal:.4f})",
    color="#ff7f0e",
    alpha=0.6,
)
plt.plot(
    pred_hgb_cal,
    true_hgb_cal,
    "s-",
    label=f"Sklearn HGB (Calibrated, ECE={ece_hgb_cal:.4f})",
    color="#ff7f0e",
    linewidth=2,
)

plt.plot(
    pred_lgbm_uncal,
    true_lgbm_uncal,
    "^--",
    label=f"LightGBM (Uncalibrated, ECE={ece_lgbm_uncal:.4f})",
    color="#2ca02c",
    alpha=0.6,
)
plt.plot(
    pred_lgbm_cal,
    true_lgbm_cal,
    "^-",
    label=f"LightGBM (Calibrated, ECE={ece_lgbm_cal:.4f})",
    color="#2ca02c",
    linewidth=2,
)

plt.xlabel("Mean Predicted Probability")
plt.ylabel("Fraction of Positives")
plt.title("Reliability Diagram: Perpetual vs Sklearn vs LightGBM")
plt.legend(loc="lower right")
plt.tight_layout()
plt.show()

7. Comparison with MAPIE

We compare against MAPIE’s SplitConformalClassifier (standard split-conformal) and CrossConformalClassifier (cross-validation based).

Both methods use the “lac” (Least Ambiguous Set-valued Classifiers) conformity score, as it is the primary method supported for binary classification in MAPIE.

[ ]:

print("Running MAPIE comparison...")

# MAPIE requires a fitted sklearn-compatible estimator
base_est = HistGradientBoostingClassifier(random_state=42)
# We fit it on X_train for SplitConformal (prefit)
base_est.fit(X_train, y_train)

for alpha in alphas:
    print(f"  MAPIE Alpha {alpha}...")

    # 1. Split Conformal (prefit)
    # Uses 'confidence_level' = 1 - alpha
    mapie_sc = SplitConformalClassifier(
        estimator=base_est,
        conformity_score="lac",
        prefit=True,
        confidence_level=[1 - alpha],
    )
    mapie_sc.conformalize(X_cal, y_cal)
    _, y_ps = mapie_sc.predict_set(X_test)
    y_ps_sets = y_ps[:, :, 0]

    # Calculate metrics for Split
    covered = 0
    sizes = []
    for i in range(len(y_test)):
        pred_set = np.where(y_ps_sets[i])[0]
        if y_test[i] in pred_set:
            covered += 1
        sizes.append(len(pred_set))

    results.append(
        {
            "Library": "MAPIE",
            "Method": "Split (LAC)",
            "Alpha": alpha,
            "Target Coverage": 1 - alpha,
            "Observed Coverage": covered / len(y_test),
            "Avg Set Size": np.mean(sizes),
        }
    )

    # 2. Cross Conformal
    # Requires re-fitting on full training data (or X_train as we leverage CV)

    # CrossConformalClassifier fits internal CV models.
    # We use 'conformity_score' (valid in mapie v1.x)
    mapie_cc = CrossConformalClassifier(
        estimator=HistGradientBoostingClassifier(random_state=42),
        conformity_score="lac",
        cv=5,
        confidence_level=[1 - alpha],
    )

    mapie_cc.fit_conformalize(X_train, y_train)
    _, y_ps = mapie_cc.predict_set(X_test)
    y_ps_sets = y_ps[:, :, 0]

    covered = 0
    sizes = []
    for i in range(len(y_test)):
        pred_set = np.where(y_ps_sets[i])[0]
        if y_test[i] in pred_set:
            covered += 1
        sizes.append(len(pred_set))

    results.append(
        {
            "Library": "MAPIE",
            "Method": "Cross (LAC)",
            "Alpha": alpha,
            "Target Coverage": 1 - alpha,
            "Observed Coverage": covered / len(y_test),
            "Avg Set Size": np.mean(sizes),
        }
    )

8. Results Analysis

We visualize the performance. Ideally, observed coverage should meet or slightly exceed the target, with the smallest possible average set size.

[ ]:

df_res = pd.DataFrame(results)
df_res["Coverage Gap"] = df_res["Observed Coverage"] - df_res["Target Coverage"]
# Create a combined label for the legend
df_res["Method Label"] = df_res["Library"] + ": " + df_res["Method"]

print(df_res.sort_values(["Alpha", "Avg Set Size"]))

# Define a custom color palette
palette = {
    "Perpetual: Conformal": "#1f77b4",  # Blue
    "Perpetual: WeightVariance": "#aec7e8",  # Light Blue
    "Perpetual: MinMax": "#ff7f0e",  # Orange
    "Perpetual: GRP": "#2ca02c",  # Green
    "MAPIE: Split (LAC)": "#d62728",  # Red
    "MAPIE: Cross (LAC)": "#9467bd",  # Purple
}

# Slightly increase vertical figure size if needed, but horizontal space is key
plt.figure(figsize=(10, 7))
ax = sns.barplot(
    data=df_res, x="Alpha", y="Avg Set Size", hue="Method Label", palette=palette
)
plt.title("Average Set Size by Method and Alpha (Lower is Better)")
plt.ylabel("Average Set Size")
plt.xlabel("Alpha (Target Error Rate)")

# Move legend to the right outside the plot
# bbox_to_anchor=(1, 1) places the top-left corner of the legend at the top-right of the axes
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
plt.show()

9. Summary and Conclusion

In this tutorial, we have explored several advanced calibration techniques provided by PerpetualBooster for both probability calibration and set-valued predictions.

Key Advantages of PerpetualBooster

Superior Probability Calibration (ECE):
- As demonstrated in our comparison against Scikit-Learn’s HistGradientBoostingClassifier and LightGBM, PerpetualBooster consistently achieves a lower Expected Calibration Error (ECE).
- Even without explicit calibration, Perpetual’s raw probabilities are often more reliable. When calibrated using Perpetual’s built-in calibrate() method, it provides state-of-the-art results that are crucial for high-stakes decision-making.
Efficient Uncertainty Quantification (Prediction Sets):
- Perpetual offers native support for generating prediction sets (for classification) and prediction intervals (for regression).
- Methods like GRP (Log-Odds Percentiles) allow users to generate well-calibrated prediction sets that maintain rigorous coverage guarantees while being strikingly efficient.
Performance without Retraining:
- Unlike many other calibration frameworks that require expensive K-fold cross-validation or model retraining, Perpetual’s calibrate() method works post-hoc on a small calibration set.
- This allows for extremely fast iterations and enables the addition of uncertainty quantification to existing models with minimal overhead.

Conclusion

Calibration is an essential step in any machine learning pipeline where the “confidence” of the model is as important as its accuracy. PerpetualBooster provides a unified, efficient, and highly performant toolkit for ensuring your models are not only accurate but also trustworthy and well-calibrated.