Heterogeneous Treatment Effects with Meta-Learners
In many real-world applications, the treatment effect is not constant — it varies across subpopulations. Estimating these Heterogeneous Treatment Effects (HTE) is critical for personalized decision-making.
This tutorial walks through a complete HTE estimation workflow using the Hillstrom E-Mail Marketing dataset:
Data preparation — encoding, train/test split, and exploratory analysis.
CATE estimation — comparing five meta-learners (S, T, X, DR, R-Learner).
Feature importance — understanding which covariates drive treatment heterogeneity.
Subgroup analysis — discovering who benefits most from treatment.
Model selection — using AUUC and Qini to choose the best learner.
Dataset: Kevin Hillstrom’s MineThatData E-Mail Analytics Challenge (64,000 customers randomly assigned to an e-mail campaign or control group).
[ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from perpetual.causal_metrics import (
auuc,
cumulative_gain_curve,
qini_coefficient,
qini_curve,
)
from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner
from perpetual.uplift import UpliftBooster
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
1. Data Preparation
We fetch the Hillstrom dataset from OpenML. The original experiment has three segments (Men’s e-mail, Women’s e-mail, No e-mail). We collapse the two e-mail groups into a single treatment indicator.
[ ]:
dataset = fetch_openml(data_id=41473, as_frame=True, parser="auto")
df = dataset.frame
# Binary treatment: any e-mail vs. control
df["treatment"] = (df["segment"] != "No E-Mail").astype(int)
# We use "conversion" (purchase) as the outcome
y = df["conversion"].astype(int).values
w = df["treatment"].values
features = [
"recency",
"history_segment",
"history",
"mens",
"womens",
"zip_code",
"newbie",
"channel",
]
X = df[features].copy()
# Mark categoricals so Perpetual handles them natively
for col in ["history_segment", "zip_code", "channel"]:
X[col] = X[col].astype("category")
print(f"Samples: {len(df):,} | Treatment rate: {w.mean():.2%}")
print(f"Outcome (purchase) rate: {y.mean():.2%}")
df.head()
[ ]:
X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(
X, w, y, test_size=0.3, random_state=42, stratify=w
)
print(f"Train: {len(X_train):,} | Test: {len(X_test):,}")
1.1 Exploratory: Average Treatment Effect (ATE)
Before looking for heterogeneity, let’s confirm an overall treatment effect exists.
[ ]:
ate = y_train[w_train == 1].mean() - y_train[w_train == 0].mean()
print(f"Naive ATE (difference in means): {ate:.4f}")
print(f" Treated purchase rate: {y_train[w_train == 1].mean():.4f}")
print(f" Control purchase rate: {y_train[w_train == 0].mean():.4f}")
2. CATE Estimation with Five Meta-Learners
We fit all five available estimators and collect their CATE predictions on held-out data.
[ ]:
learners = {
"S-Learner": SLearner(budget=0.2),
"T-Learner": TLearner(budget=0.2),
"X-Learner": XLearner(budget=0.2),
"DR-Learner": DRLearner(budget=0.2, clip=0.01),
}
cate_preds = {}
for name, learner in learners.items():
learner.fit(X_train, w_train, y_train)
cate_preds[name] = learner.predict(X_test)
print(f"{name:12s} avg CATE = {cate_preds[name].mean():+.5f}")
# R-Learner via UpliftBooster
ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)
ub.fit(X_train, w_train, y_train)
cate_preds["R-Learner"] = ub.predict(X_test)
print(f"{'R-Learner':12s} avg CATE = {cate_preds['R-Learner'].mean():+.5f}")
3. Feature Importance — What Drives Heterogeneity?
After fitting, meta-learners expose feature_importances_, which shows which covariates explain the most variation in the treatment effect.
[ ]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=False)
for ax, (name, learner) in zip(axes.flat, learners.items()):
importances = learner.feature_importances_
if importances is not None:
idx = np.argsort(importances)
ax.barh(np.array(features)[idx], importances[idx])
ax.set_title(f"{name} Feature Importances")
else:
ax.text(0.5, 0.5, "Not available", ha="center")
plt.tight_layout()
plt.show()
4. Subgroup Analysis
We partition the test set into quintiles of predicted CATE (using the DR-Learner) and compare the observed uplift (difference in means) within each group.
[ ]:
# Use DR-Learner CATE scores
tau_hat = cate_preds["DR-Learner"]
# Quintile bins
quantiles = np.quantile(tau_hat, [0.2, 0.4, 0.6, 0.8])
bins = np.digitize(tau_hat, quantiles)
rows = []
for q in range(5):
mask = bins == q
n_q = mask.sum()
y_t = y_test[mask & (w_test == 1)]
y_c = y_test[mask & (w_test == 0)]
obs_uplift = y_t.mean() - y_c.mean() if len(y_t) > 0 and len(y_c) > 0 else np.nan
rows.append(
{
"Quintile": q + 1,
"n": n_q,
"Avg Predicted CATE": tau_hat[mask].mean(),
"Observed Uplift": obs_uplift,
}
)
subgroup_df = pd.DataFrame(rows)
print(subgroup_df.to_string(index=False))
[ ]:
fig, ax = plt.subplots(figsize=(8, 4))
x = subgroup_df["Quintile"]
width = 0.35
ax.bar(x - width / 2, subgroup_df["Avg Predicted CATE"], width, label="Predicted CATE")
ax.bar(x + width / 2, subgroup_df["Observed Uplift"], width, label="Observed Uplift")
ax.set_xlabel("CATE Quintile")
ax.set_ylabel("Effect")
ax.set_title("Predicted vs. Observed Uplift by Quintile")
ax.legend()
ax.axhline(0, color="grey", linewidth=0.5)
plt.tight_layout()
plt.show()
5. Model Selection with AUUC and Qini
When ground-truth CATE is unavailable, AUUC (Area Under the Uplift Curve) and the Qini Coefficient are the standard metrics for ranking meta-learner performance.
[ ]:
# Uplift curves
plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
fracs, gains = cumulative_gain_curve(y_test, w_test, scores)
plt.plot(fracs, gains, label=name)
plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Cumulative Uplift Gain — Model Comparison")
plt.xlabel("Fraction of Population (sorted by predicted CATE)")
plt.ylabel("Cumulative Gain")
plt.legend()
plt.tight_layout()
plt.show()
[ ]:
# Summary table
rows = []
for name, scores in cate_preds.items():
a = auuc(y_test, w_test, scores, normalize=True)
q = qini_coefficient(y_test, w_test, scores)
rows.append({"Learner": name, "AUUC (norm)": f"{a:+.4f}", "Qini": f"{q:+.4f}"})
results = pd.DataFrame(rows)
print(results.to_string(index=False))
5.1 Qini Curves
The Qini curve generalises the uplift curve by weighting for treatment/control group sizes.
[ ]:
plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
fracs, qvals = qini_curve(y_test, w_test, scores)
plt.plot(fracs, qvals, label=name)
plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Qini Curves — Model Comparison")
plt.xlabel("Fraction of Population")
plt.ylabel("Qini Value")
plt.legend()
plt.tight_layout()
plt.show()
Key Takeaways
Concept |
Insight |
|---|---|
S-Learner |
Simplest; can underfit heterogeneity because treatment is just one feature. |
T-Learner |
Fits separate models; may overfit when treatment/control groups differ in size. |
X-Learner |
Uses cross-imputation; good when treatment groups are unbalanced. |
DR-Learner |
Doubly robust to misspecification of either outcome or propensity model. |
R-Learner |
Directly optimises CATE via residual-on-residual regression; often best for RCTs. |
AUUC / Qini |
Use these to compare learners when ground-truth CATE is not observed. |
Subgroup analysis |
Validate that predicted heterogeneity aligns with observed uplift within quintiles. |