Heterogeneous Treatment Effects with Meta-Learners

In many real-world applications, the treatment effect is not constant — it varies across subpopulations. Estimating these Heterogeneous Treatment Effects (HTE) is critical for personalized decision-making.

This tutorial walks through a complete HTE estimation workflow using the Hillstrom E-Mail Marketing dataset:

  1. Data preparation — encoding, train/test split, and exploratory analysis.

  2. CATE estimation — comparing five meta-learners (S, T, X, DR, R-Learner).

  3. Feature importance — understanding which covariates drive treatment heterogeneity.

  4. Subgroup analysis — discovering who benefits most from treatment.

  5. Model selection — using AUUC and Qini to choose the best learner.

Dataset: Kevin Hillstrom’s MineThatData E-Mail Analytics Challenge (64,000 customers randomly assigned to an e-mail campaign or control group).

[ ]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from perpetual.causal_metrics import (
    auuc,
    cumulative_gain_curve,
    qini_coefficient,
    qini_curve,
)
from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner
from perpetual.uplift import UpliftBooster
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Data Preparation

We fetch the Hillstrom dataset from OpenML. The original experiment has three segments (Men’s e-mail, Women’s e-mail, No e-mail). We collapse the two e-mail groups into a single treatment indicator.

[ ]:
dataset = fetch_openml(data_id=41473, as_frame=True, parser="auto")
df = dataset.frame

# Binary treatment: any e-mail vs. control
df["treatment"] = (df["segment"] != "No E-Mail").astype(int)

# We use "conversion" (purchase) as the outcome
y = df["conversion"].astype(int).values
w = df["treatment"].values

features = [
    "recency",
    "history_segment",
    "history",
    "mens",
    "womens",
    "zip_code",
    "newbie",
    "channel",
]
X = df[features].copy()

# Mark categoricals so Perpetual handles them natively
for col in ["history_segment", "zip_code", "channel"]:
    X[col] = X[col].astype("category")

print(f"Samples: {len(df):,}  |  Treatment rate: {w.mean():.2%}")
print(f"Outcome (purchase) rate: {y.mean():.2%}")
df.head()
[ ]:
X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(
    X, w, y, test_size=0.3, random_state=42, stratify=w
)
print(f"Train: {len(X_train):,}  |  Test: {len(X_test):,}")

1.1 Exploratory: Average Treatment Effect (ATE)

Before looking for heterogeneity, let’s confirm an overall treatment effect exists.

[ ]:
ate = y_train[w_train == 1].mean() - y_train[w_train == 0].mean()
print(f"Naive ATE (difference in means): {ate:.4f}")
print(f"  Treated purchase rate:  {y_train[w_train == 1].mean():.4f}")
print(f"  Control purchase rate:  {y_train[w_train == 0].mean():.4f}")

2. CATE Estimation with Five Meta-Learners

We fit all five available estimators and collect their CATE predictions on held-out data.

[ ]:
learners = {
    "S-Learner": SLearner(budget=0.2),
    "T-Learner": TLearner(budget=0.2),
    "X-Learner": XLearner(budget=0.2),
    "DR-Learner": DRLearner(budget=0.2, clip=0.01),
}

cate_preds = {}
for name, learner in learners.items():
    learner.fit(X_train, w_train, y_train)
    cate_preds[name] = learner.predict(X_test)
    print(f"{name:12s}  avg CATE = {cate_preds[name].mean():+.5f}")

# R-Learner via UpliftBooster
ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)
ub.fit(X_train, w_train, y_train)
cate_preds["R-Learner"] = ub.predict(X_test)
print(f"{'R-Learner':12s}  avg CATE = {cate_preds['R-Learner'].mean():+.5f}")

3. Feature Importance — What Drives Heterogeneity?

After fitting, meta-learners expose feature_importances_, which shows which covariates explain the most variation in the treatment effect.

[ ]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=False)

for ax, (name, learner) in zip(axes.flat, learners.items()):
    importances = learner.feature_importances_
    if importances is not None:
        idx = np.argsort(importances)
        ax.barh(np.array(features)[idx], importances[idx])
        ax.set_title(f"{name} Feature Importances")
    else:
        ax.text(0.5, 0.5, "Not available", ha="center")

plt.tight_layout()
plt.show()

4. Subgroup Analysis

We partition the test set into quintiles of predicted CATE (using the DR-Learner) and compare the observed uplift (difference in means) within each group.

[ ]:
# Use DR-Learner CATE scores
tau_hat = cate_preds["DR-Learner"]

# Quintile bins
quantiles = np.quantile(tau_hat, [0.2, 0.4, 0.6, 0.8])
bins = np.digitize(tau_hat, quantiles)

rows = []
for q in range(5):
    mask = bins == q
    n_q = mask.sum()
    y_t = y_test[mask & (w_test == 1)]
    y_c = y_test[mask & (w_test == 0)]
    obs_uplift = y_t.mean() - y_c.mean() if len(y_t) > 0 and len(y_c) > 0 else np.nan
    rows.append(
        {
            "Quintile": q + 1,
            "n": n_q,
            "Avg Predicted CATE": tau_hat[mask].mean(),
            "Observed Uplift": obs_uplift,
        }
    )

subgroup_df = pd.DataFrame(rows)
print(subgroup_df.to_string(index=False))
[ ]:
fig, ax = plt.subplots(figsize=(8, 4))
x = subgroup_df["Quintile"]
width = 0.35
ax.bar(x - width / 2, subgroup_df["Avg Predicted CATE"], width, label="Predicted CATE")
ax.bar(x + width / 2, subgroup_df["Observed Uplift"], width, label="Observed Uplift")
ax.set_xlabel("CATE Quintile")
ax.set_ylabel("Effect")
ax.set_title("Predicted vs. Observed Uplift by Quintile")
ax.legend()
ax.axhline(0, color="grey", linewidth=0.5)
plt.tight_layout()
plt.show()

5. Model Selection with AUUC and Qini

When ground-truth CATE is unavailable, AUUC (Area Under the Uplift Curve) and the Qini Coefficient are the standard metrics for ranking meta-learner performance.

[ ]:
# Uplift curves
plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
    fracs, gains = cumulative_gain_curve(y_test, w_test, scores)
    plt.plot(fracs, gains, label=name)

plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Cumulative Uplift Gain — Model Comparison")
plt.xlabel("Fraction of Population (sorted by predicted CATE)")
plt.ylabel("Cumulative Gain")
plt.legend()
plt.tight_layout()
plt.show()
[ ]:
# Summary table
rows = []
for name, scores in cate_preds.items():
    a = auuc(y_test, w_test, scores, normalize=True)
    q = qini_coefficient(y_test, w_test, scores)
    rows.append({"Learner": name, "AUUC (norm)": f"{a:+.4f}", "Qini": f"{q:+.4f}"})

results = pd.DataFrame(rows)
print(results.to_string(index=False))

5.1 Qini Curves

The Qini curve generalises the uplift curve by weighting for treatment/control group sizes.

[ ]:
plt.figure(figsize=(10, 5))
for name, scores in cate_preds.items():
    fracs, qvals = qini_curve(y_test, w_test, scores)
    plt.plot(fracs, qvals, label=name)

plt.plot([0, 1], [0, 0], "k--", label="Random")
plt.title("Qini Curves — Model Comparison")
plt.xlabel("Fraction of Population")
plt.ylabel("Qini Value")
plt.legend()
plt.tight_layout()
plt.show()

Key Takeaways

Concept

Insight

S-Learner

Simplest; can underfit heterogeneity because treatment is just one feature.

T-Learner

Fits separate models; may overfit when treatment/control groups differ in size.

X-Learner

Uses cross-imputation; good when treatment groups are unbalanced.

DR-Learner

Doubly robust to misspecification of either outcome or propensity model.

R-Learner

Directly optimises CATE via residual-on-residual regression; often best for RCTs.

AUUC / Qini

Use these to compare learners when ground-truth CATE is not observed.

Subgroup analysis

Validate that predicted heterogeneity aligns with observed uplift within quintiles.