Heterogeneous Treatment Effects with Meta-Learners

In many real-world applications, the treatment effect is not constant — it varies across subpopulations. Estimating these Heterogeneous Treatment Effects (HTE) is critical for personalized decision-making.

This tutorial walks through a complete HTE estimation workflow using the Hillstrom E-Mail Marketing dataset:

Data preparation — encoding, train/test split, and exploratory analysis.
CATE estimation — comparing five meta-learners (S, T, X, DR, R-Learner).
Feature importance — understanding which covariates drive treatment heterogeneity.
Subgroup analysis — discovering who benefits most from treatment.
Model selection — using AUUC and Qini to choose the best learner.

Dataset: Kevin Hillstrom’s MineThatData E-Mail Analytics Challenge (64,000 customers randomly assigned to an e-mail campaign or control group).

[ ]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from perpetual.causal_metrics import (
    auuc,
    cumulative_gain_curve,
    qini_coefficient,
    qini_curve,
)
from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner
from perpetual.uplift import UpliftBooster
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Data Preparation

We fetch the Hillstrom dataset from OpenML. The original experiment has three segments (Men’s e-mail, Women’s e-mail, No e-mail). We collapse the two e-mail groups into a single treatment indicator.

[ ]:

# Load COVID-19 Hospitals Treatment Plan dataset from OpenML
dataset = fetch_openml(data_id=43550, as_frame=True)
df = dataset.frame
# Binary treatment: surgery department vs. others
df["treatment"] = (df["Department"] == "surgery").astype(int)


# Binary outcome: Stay_Days 0-10 = 0, >10 = 1
def binarize_stay(x):
    try:
        low, high = x.split("-")
        return int(float(high) > 10)
    except Exception:
        return 0


y = df["Stay_Days"].apply(binarize_stay).values
w = df["treatment"].values
# Select features (exclude outcome, treatment, and obvious IDs)
features = [
    "Hospital",
    "Hospital_type",
    "Hospital_city",
    "Hospital_region",
    "Available_Extra_Rooms_in_Hospital",
    "Ward_Type",
    "Ward_Facility",
    "Bed_Grade",
    "City_Code_Patient",
    "Type_of_Admission",
    "Illness_Severity",
    "Patient_Visitors",
    "Age",
    "Admission_Deposit",
]
X = df[features].copy()
# Mark categoricals so Perpetual handles them natively
for col in [
    "Hospital_type",
    "Hospital_city",
    "Hospital_region",
    "Ward_Type",
    "Ward_Facility",
    "Type_of_Admission",
    "Illness_Severity",
    "Age",
]:
    X[col] = X[col].astype("category")
print(f"Samples: {len(df):,}  |  Treatment rate: {w.mean():.2%}")
print(f"Outcome (long stay) rate: {y.mean():.2%}")
df.head()

[ ]:

X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(
    X, w, y, test_size=0.3, random_state=42, stratify=w
)
print(f"Train: {len(X_train):,}  |  Test: {len(X_test):,}")

1.1 Exploratory: Average Treatment Effect (ATE)

Before looking for heterogeneity, let’s confirm an overall treatment effect exists.

[ ]:

ate = y_train[w_train == 1].mean() - y_train[w_train == 0].mean()
print(f"Naive ATE (difference in means): {ate:.4f}")
print(f"  Treated purchase rate:  {y_train[w_train == 1].mean():.4f}")
print(f"  Control purchase rate:  {y_train[w_train == 0].mean():.4f}")

2. CATE Estimation with Five Meta-Learners

We fit all five available estimators and collect their CATE predictions on held-out data.

[ ]:

learners = {
    "S-Learner": SLearner(budget=0.2),
    "T-Learner": TLearner(budget=0.2),
    "X-Learner": XLearner(budget=0.2),
    "DR-Learner": DRLearner(budget=0.2, clip=0.01),
}

cate_preds = {}
for name, learner in learners.items():
    learner.fit(X_train, w_train, y_train)
    cate_preds[name] = learner.predict(X_test)
    print(f"{name:12s}  avg CATE = {cate_preds[name].mean():+.5f}")

# R-Learner via UpliftBooster
ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)
ub.fit(X_train, w_train, y_train)
cate_preds["R-Learner"] = ub.predict(X_test)
print(f"{'R-Learner':12s}  avg CATE = {cate_preds['R-Learner'].mean():+.5f}")

3. Feature Importance — What Drives Heterogeneity?

After fitting, meta-learners expose feature_importances_, which shows which covariates explain the most variation in the treatment effect.

[ ]:

fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=False)

for ax, (name, learner) in zip(axes.flat, learners.items()):
    importances = learner.feature_importances_
    if importances is not None:
        idx = np.argsort(importances)
        ax.barh(np.array(features)[idx], importances[idx])
        ax.set_title(f"{name} Feature Importances")
    else:
        ax.text(0.5, 0.5, "Not available", ha="center")

plt.tight_layout()
plt.show()

4. Subgroup Analysis

We partition the test set into quintiles of predicted CATE (using the DR-Learner) and compare the observed uplift (difference in means) within each group.

[ ]:

# Use DR-Learner CATE scores
tau_hat = cate_preds["DR-Learner"]

# Quintile bins
quantiles = np.quantile(tau_hat, [0.2, 0.4, 0.6, 0.8])
bins = np.digitize(tau_hat, quantiles)

rows = []
for q in range(5):
    mask = bins == q
    n_q = mask.sum()
    y_t = y_test[mask & (w_test == 1)]
    y_c = y_test[mask & (w_test == 0)]
    obs_uplift = y_t.mean() - y_c.mean() if len(y_t) > 0 and len(y_c) > 0 else np.nan
    rows.append(
        {
            "Quintile": q + 1,
            "n": n_q,
            "Avg Predicted CATE": tau_hat[mask].mean(),
            "Observed Uplift": obs_uplift,
        }
    )

subgroup_df = pd.DataFrame(rows)
print(subgroup_df.to_string(index=False))

[ ]:

fig, ax = plt.subplots(figsize=(8, 4))
x = subgroup_df["Quintile"]
width = 0.35
ax.bar(x - width / 2, subgroup_df["Avg Predicted CATE"], width, label="Predicted CATE")
ax.bar(x + width / 2, subgroup_df["Observed Uplift"], width, label="Observed Uplift")
ax.set_xlabel("CATE Quintile")
ax.set_ylabel("Effect")
ax.set_title("Predicted vs. Observed Uplift by Quintile")
ax.legend()
ax.axhline(0, color="grey", linewidth=0.5)
plt.tight_layout()
plt.show()

5. Model Selection with AUUC and Qini

When ground-truth CATE is unavailable, AUUC (Area Under the Uplift Curve) and the Qini Coefficient are the standard metrics for ranking meta-learner performance.

[ ]:

# Uplift curves
plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
    fracs, gains = cumulative_gain_curve(y_test, w_test, scores)
    plt.plot(fracs, gains, label=name)

plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Cumulative Uplift Gain — Model Comparison")
plt.xlabel("Fraction of Population (sorted by predicted CATE)")
plt.ylabel("Cumulative Gain")
plt.legend()
plt.tight_layout()
plt.show()

[ ]:

# Summary table
rows = []
for name, scores in cate_preds.items():
    a = auuc(y_test, w_test, scores, normalize=True)
    q = qini_coefficient(y_test, w_test, scores)
    rows.append({"Learner": name, "AUUC (norm)": f"{a:+.4f}", "Qini": f"{q:+.4f}"})

results = pd.DataFrame(rows)
print(results.to_string(index=False))

5.1 Qini Curves

The Qini curve generalises the uplift curve by weighting for treatment/control group sizes.

[ ]:

plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
    fracs, qvals = qini_curve(y_test, w_test, scores)
    plt.plot(fracs, qvals, label=name)

plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Qini Curves — Model Comparison")
plt.xlabel("Fraction of Population")
plt.ylabel("Qini Value")
plt.legend()
plt.tight_layout()
plt.show()

Key Takeaways

Concept	Insight
S-Learner	Simplest; can underfit heterogeneity because treatment is just one feature.
T-Learner	Fits separate models; may overfit when treatment/control groups differ in size.
X-Learner	Uses cross-imputation; good when treatment groups are unbalanced.
DR-Learner	Doubly robust to misspecification of either outcome or propensity model.
R-Learner	Directly optimises CATE via residual-on-residual regression; often best for RCTs.
AUUC / Qini	Use these to compare learners when ground-truth CATE is not observed.
Subgroup analysis	Validate that predicted heterogeneity aligns with observed uplift within quintiles.