{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Heterogeneous Treatment Effects with Meta-Learners\n", "\n", "In many real-world applications, the treatment effect is not constant — it varies across subpopulations. Estimating these **Heterogeneous Treatment Effects (HTE)** is critical for personalized decision-making.\n", "\n", "This tutorial walks through a complete HTE estimation workflow using the **Hillstrom E-Mail Marketing** dataset:\n", "\n", "1. **Data preparation** — encoding, train/test split, and exploratory analysis.\n", "2. **CATE estimation** — comparing five meta-learners (S, T, X, DR, R-Learner).\n", "3. **Feature importance** — understanding which covariates drive treatment heterogeneity.\n", "4. **Subgroup analysis** — discovering who benefits most from treatment.\n", "5. **Model selection** — using AUUC and Qini to choose the best learner.\n", "\n", "> **Dataset:** Kevin Hillstrom's MineThatData E-Mail Analytics Challenge (64,000 customers randomly assigned to an e-mail campaign or control group)." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from perpetual.causal_metrics import (\n", " auuc,\n", " cumulative_gain_curve,\n", " qini_coefficient,\n", " qini_curve,\n", ")\n", "from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner\n", "from perpetual.uplift import UpliftBooster\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## 1. Data Preparation\n", "\n", "We fetch the **Hillstrom** dataset from OpenML. The original experiment has three segments (Men's e-mail, Women's e-mail, No e-mail). We collapse the two e-mail groups into a single treatment indicator." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "dataset = fetch_openml(data_id=41473, as_frame=True, parser=\"auto\")\n", "df = dataset.frame\n", "\n", "# Binary treatment: any e-mail vs. control\n", "df[\"treatment\"] = (df[\"segment\"] != \"No E-Mail\").astype(int)\n", "\n", "# We use \"conversion\" (purchase) as the outcome\n", "y = df[\"conversion\"].astype(int).values\n", "w = df[\"treatment\"].values\n", "\n", "features = [\n", " \"recency\",\n", " \"history_segment\",\n", " \"history\",\n", " \"mens\",\n", " \"womens\",\n", " \"zip_code\",\n", " \"newbie\",\n", " \"channel\",\n", "]\n", "X = df[features].copy()\n", "\n", "# Mark categoricals so Perpetual handles them natively\n", "for col in [\"history_segment\", \"zip_code\", \"channel\"]:\n", " X[col] = X[col].astype(\"category\")\n", "\n", "print(f\"Samples: {len(df):,} | Treatment rate: {w.mean():.2%}\")\n", "print(f\"Outcome (purchase) rate: {y.mean():.2%}\")\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(\n", " X, w, y, test_size=0.3, random_state=42, stratify=w\n", ")\n", "print(f\"Train: {len(X_train):,} | Test: {len(X_test):,}\")" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "### 1.1 Exploratory: Average Treatment Effect (ATE)\n", "\n", "Before looking for heterogeneity, let's confirm an overall treatment effect exists." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "ate = y_train[w_train == 1].mean() - y_train[w_train == 0].mean()\n", "print(f\"Naive ATE (difference in means): {ate:.4f}\")\n", "print(f\" Treated purchase rate: {y_train[w_train == 1].mean():.4f}\")\n", "print(f\" Control purchase rate: {y_train[w_train == 0].mean():.4f}\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "## 2. CATE Estimation with Five Meta-Learners\n", "\n", "We fit all five available estimators and collect their CATE predictions on held-out data." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "learners = {\n", " \"S-Learner\": SLearner(budget=0.2),\n", " \"T-Learner\": TLearner(budget=0.2),\n", " \"X-Learner\": XLearner(budget=0.2),\n", " \"DR-Learner\": DRLearner(budget=0.2, clip=0.01),\n", "}\n", "\n", "cate_preds = {}\n", "for name, learner in learners.items():\n", " learner.fit(X_train, w_train, y_train)\n", " cate_preds[name] = learner.predict(X_test)\n", " print(f\"{name:12s} avg CATE = {cate_preds[name].mean():+.5f}\")\n", "\n", "# R-Learner via UpliftBooster\n", "ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)\n", "ub.fit(X_train, w_train, y_train)\n", "cate_preds[\"R-Learner\"] = ub.predict(X_test)\n", "print(f\"{'R-Learner':12s} avg CATE = {cate_preds['R-Learner'].mean():+.5f}\")" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "## 3. Feature Importance — What Drives Heterogeneity?\n", "\n", "After fitting, meta-learners expose `feature_importances_`, which shows which covariates explain the most variation in the treatment effect." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=False)\n", "\n", "for ax, (name, learner) in zip(axes.flat, learners.items()):\n", " importances = learner.feature_importances_\n", " if importances is not None:\n", " idx = np.argsort(importances)\n", " ax.barh(np.array(features)[idx], importances[idx])\n", " ax.set_title(f\"{name} Feature Importances\")\n", " else:\n", " ax.text(0.5, 0.5, \"Not available\", ha=\"center\")\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "## 4. Subgroup Analysis\n", "\n", "We partition the test set into quintiles of predicted CATE (using the DR-Learner) and compare the **observed uplift** (difference in means) within each group." ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "# Use DR-Learner CATE scores\n", "tau_hat = cate_preds[\"DR-Learner\"]\n", "\n", "# Quintile bins\n", "quantiles = np.quantile(tau_hat, [0.2, 0.4, 0.6, 0.8])\n", "bins = np.digitize(tau_hat, quantiles)\n", "\n", "rows = []\n", "for q in range(5):\n", " mask = bins == q\n", " n_q = mask.sum()\n", " y_t = y_test[mask & (w_test == 1)]\n", " y_c = y_test[mask & (w_test == 0)]\n", " obs_uplift = y_t.mean() - y_c.mean() if len(y_t) > 0 and len(y_c) > 0 else np.nan\n", " rows.append(\n", " {\n", " \"Quintile\": q + 1,\n", " \"n\": n_q,\n", " \"Avg Predicted CATE\": tau_hat[mask].mean(),\n", " \"Observed Uplift\": obs_uplift,\n", " }\n", " )\n", "\n", "subgroup_df = pd.DataFrame(rows)\n", "print(subgroup_df.to_string(index=False))" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 4))\n", "x = subgroup_df[\"Quintile\"]\n", "width = 0.35\n", "ax.bar(x - width / 2, subgroup_df[\"Avg Predicted CATE\"], width, label=\"Predicted CATE\")\n", "ax.bar(x + width / 2, subgroup_df[\"Observed Uplift\"], width, label=\"Observed Uplift\")\n", "ax.set_xlabel(\"CATE Quintile\")\n", "ax.set_ylabel(\"Effect\")\n", "ax.set_title(\"Predicted vs. Observed Uplift by Quintile\")\n", "ax.legend()\n", "ax.axhline(0, color=\"grey\", linewidth=0.5)\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## 5. Model Selection with AUUC and Qini\n", "\n", "When ground-truth CATE is unavailable, **AUUC** (Area Under the Uplift Curve) and the **Qini Coefficient** are the standard metrics for ranking meta-learner performance." ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "# Uplift curves\n", "plt.figure(figsize=(10, 5))\n", "for name, scores in cate_preds.items():\n", " fracs, gains = cumulative_gain_curve(y_test, w_test, scores)\n", " plt.plot(fracs, gains, label=name)\n", "\n", "plt.plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n", "plt.title(\"Cumulative Uplift Gain — Model Comparison\")\n", "plt.xlabel(\"Fraction of Population (sorted by predicted CATE)\")\n", "plt.ylabel(\"Cumulative Gain\")\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "# Summary table\n", "rows = []\n", "for name, scores in cate_preds.items():\n", " a = auuc(y_test, w_test, scores, normalize=True)\n", " q = qini_coefficient(y_test, w_test, scores)\n", " rows.append({\"Learner\": name, \"AUUC (norm)\": f\"{a:+.4f}\", \"Qini\": f\"{q:+.4f}\"})\n", "\n", "results = pd.DataFrame(rows)\n", "print(results.to_string(index=False))" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "### 5.1 Qini Curves\n", "\n", "The Qini curve generalises the uplift curve by weighting for treatment/control group sizes." ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 5))\n", "for name, scores in cate_preds.items():\n", " fracs, qvals = qini_curve(y_test, w_test, scores)\n", " plt.plot(fracs, qvals, label=name)\n", "\n", "plt.plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n", "plt.title(\"Qini Curves — Model Comparison\")\n", "plt.xlabel(\"Fraction of Population\")\n", "plt.ylabel(\"Qini Value\")\n", "plt.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "## Key Takeaways\n", "\n", "| Concept | Insight |\n", "|---|---|\n", "| **S-Learner** | Simplest; can underfit heterogeneity because treatment is just one feature. |\n", "| **T-Learner** | Fits separate models; may overfit when treatment/control groups differ in size. |\n", "| **X-Learner** | Uses cross-imputation; good when treatment groups are unbalanced. |\n", "| **DR-Learner** | Doubly robust to misspecification of either outcome or propensity model. |\n", "| **R-Learner** | Directly optimises CATE via residual-on-residual regression; often best for RCTs. |\n", "| **AUUC / Qini** | Use these to compare learners when ground-truth CATE is not observed. |\n", "| **Subgroup analysis** | Validate that predicted heterogeneity aligns with observed uplift within quintiles. |" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }