{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Heterogeneous Treatment Effects with Meta-Learners\n",
    "\n",
    "In many real-world applications, the treatment effect is not constant — it varies across subpopulations. Estimating these **Heterogeneous Treatment Effects (HTE)** is critical for personalized decision-making.\n",
    "\n",
    "This tutorial walks through a complete HTE estimation workflow using the **Hillstrom E-Mail Marketing** dataset:\n",
    "\n",
    "1. **Data preparation** — encoding, train/test split, and exploratory analysis.\n",
    "2. **CATE estimation** — comparing five meta-learners (S, T, X, DR, R-Learner).\n",
    "3. **Feature importance** — understanding which covariates drive treatment heterogeneity.\n",
    "4. **Subgroup analysis** — discovering who benefits most from treatment.\n",
    "5. **Model selection** — using AUUC and Qini to choose the best learner.\n",
    "\n",
    "> **Dataset:** Kevin Hillstrom's MineThatData E-Mail Analytics Challenge (64,000 customers randomly assigned to an e-mail campaign or control group)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from perpetual.causal_metrics import (\n",
    "    auuc,\n",
    "    cumulative_gain_curve,\n",
    "    qini_coefficient,\n",
    "    qini_curve,\n",
    ")\n",
    "from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner\n",
    "from perpetual.uplift import UpliftBooster\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "## 1. Data Preparation\n",
    "\n",
    "We fetch the **Hillstrom** dataset from OpenML. The original experiment has three segments (Men's e-mail, Women's e-mail, No e-mail). We collapse the two e-mail groups into a single treatment indicator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = fetch_openml(data_id=41473, as_frame=True, parser=\"auto\")\n",
    "df = dataset.frame\n",
    "\n",
    "# Binary treatment: any e-mail vs. control\n",
    "df[\"treatment\"] = (df[\"segment\"] != \"No E-Mail\").astype(int)\n",
    "\n",
    "# We use \"conversion\" (purchase) as the outcome\n",
    "y = df[\"conversion\"].astype(int).values\n",
    "w = df[\"treatment\"].values\n",
    "\n",
    "features = [\n",
    "    \"recency\",\n",
    "    \"history_segment\",\n",
    "    \"history\",\n",
    "    \"mens\",\n",
    "    \"womens\",\n",
    "    \"zip_code\",\n",
    "    \"newbie\",\n",
    "    \"channel\",\n",
    "]\n",
    "X = df[features].copy()\n",
    "\n",
    "# Mark categoricals so Perpetual handles them natively\n",
    "for col in [\"history_segment\", \"zip_code\", \"channel\"]:\n",
    "    X[col] = X[col].astype(\"category\")\n",
    "\n",
    "print(f\"Samples: {len(df):,}  |  Treatment rate: {w.mean():.2%}\")\n",
    "print(f\"Outcome (purchase) rate: {y.mean():.2%}\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(\n",
    "    X, w, y, test_size=0.3, random_state=42, stratify=w\n",
    ")\n",
    "print(f\"Train: {len(X_train):,}  |  Test: {len(X_test):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "### 1.1 Exploratory: Average Treatment Effect (ATE)\n",
    "\n",
    "Before looking for heterogeneity, let's confirm an overall treatment effect exists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "ate = y_train[w_train == 1].mean() - y_train[w_train == 0].mean()\n",
    "print(f\"Naive ATE (difference in means): {ate:.4f}\")\n",
    "print(f\"  Treated purchase rate:  {y_train[w_train == 1].mean():.4f}\")\n",
    "print(f\"  Control purchase rate:  {y_train[w_train == 0].mean():.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## 2. CATE Estimation with Five Meta-Learners\n",
    "\n",
    "We fit all five available estimators and collect their CATE predictions on held-out data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "learners = {\n",
    "    \"S-Learner\": SLearner(budget=0.2),\n",
    "    \"T-Learner\": TLearner(budget=0.2),\n",
    "    \"X-Learner\": XLearner(budget=0.2),\n",
    "    \"DR-Learner\": DRLearner(budget=0.2, clip=0.01),\n",
    "}\n",
    "\n",
    "cate_preds = {}\n",
    "for name, learner in learners.items():\n",
    "    learner.fit(X_train, w_train, y_train)\n",
    "    cate_preds[name] = learner.predict(X_test)\n",
    "    print(f\"{name:12s}  avg CATE = {cate_preds[name].mean():+.5f}\")\n",
    "\n",
    "# R-Learner via UpliftBooster\n",
    "ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)\n",
    "ub.fit(X_train, w_train, y_train)\n",
    "cate_preds[\"R-Learner\"] = ub.predict(X_test)\n",
    "print(f\"{'R-Learner':12s}  avg CATE = {cate_preds['R-Learner'].mean():+.5f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "## 3. Feature Importance — What Drives Heterogeneity?\n",
    "\n",
    "After fitting, meta-learners expose `feature_importances_`, which shows which covariates explain the most variation in the treatment effect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=False)\n",
    "\n",
    "for ax, (name, learner) in zip(axes.flat, learners.items()):\n",
    "    importances = learner.feature_importances_\n",
    "    if importances is not None:\n",
    "        idx = np.argsort(importances)\n",
    "        ax.barh(np.array(features)[idx], importances[idx])\n",
    "        ax.set_title(f\"{name} Feature Importances\")\n",
    "    else:\n",
    "        ax.text(0.5, 0.5, \"Not available\", ha=\"center\")\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## 4. Subgroup Analysis\n",
    "\n",
    "We partition the test set into quintiles of predicted CATE (using the DR-Learner) and compare the **observed uplift** (difference in means) within each group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use DR-Learner CATE scores\n",
    "tau_hat = cate_preds[\"DR-Learner\"]\n",
    "\n",
    "# Quintile bins\n",
    "quantiles = np.quantile(tau_hat, [0.2, 0.4, 0.6, 0.8])\n",
    "bins = np.digitize(tau_hat, quantiles)\n",
    "\n",
    "rows = []\n",
    "for q in range(5):\n",
    "    mask = bins == q\n",
    "    n_q = mask.sum()\n",
    "    y_t = y_test[mask & (w_test == 1)]\n",
    "    y_c = y_test[mask & (w_test == 0)]\n",
    "    obs_uplift = y_t.mean() - y_c.mean() if len(y_t) > 0 and len(y_c) > 0 else np.nan\n",
    "    rows.append(\n",
    "        {\n",
    "            \"Quintile\": q + 1,\n",
    "            \"n\": n_q,\n",
    "            \"Avg Predicted CATE\": tau_hat[mask].mean(),\n",
    "            \"Observed Uplift\": obs_uplift,\n",
    "        }\n",
    "    )\n",
    "\n",
    "subgroup_df = pd.DataFrame(rows)\n",
    "print(subgroup_df.to_string(index=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 4))\n",
    "x = subgroup_df[\"Quintile\"]\n",
    "width = 0.35\n",
    "ax.bar(x - width / 2, subgroup_df[\"Avg Predicted CATE\"], width, label=\"Predicted CATE\")\n",
    "ax.bar(x + width / 2, subgroup_df[\"Observed Uplift\"], width, label=\"Observed Uplift\")\n",
    "ax.set_xlabel(\"CATE Quintile\")\n",
    "ax.set_ylabel(\"Effect\")\n",
    "ax.set_title(\"Predicted vs. Observed Uplift by Quintile\")\n",
    "ax.legend()\n",
    "ax.axhline(0, color=\"grey\", linewidth=0.5)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## 5. Model Selection with AUUC and Qini\n",
    "\n",
    "When ground-truth CATE is unavailable, **AUUC** (Area Under the Uplift Curve) and the **Qini Coefficient** are the standard metrics for ranking meta-learner performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Uplift curves\n",
    "plt.figure(figsize=(10, 5))\n",
    "for name, scores in cate_preds.items():\n",
    "    fracs, gains = cumulative_gain_curve(y_test, w_test, scores)\n",
    "    plt.plot(fracs, gains, label=name)\n",
    "\n",
    "plt.plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n",
    "plt.title(\"Cumulative Uplift Gain — Model Comparison\")\n",
    "plt.xlabel(\"Fraction of Population (sorted by predicted CATE)\")\n",
    "plt.ylabel(\"Cumulative Gain\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary table\n",
    "rows = []\n",
    "for name, scores in cate_preds.items():\n",
    "    a = auuc(y_test, w_test, scores, normalize=True)\n",
    "    q = qini_coefficient(y_test, w_test, scores)\n",
    "    rows.append({\"Learner\": name, \"AUUC (norm)\": f\"{a:+.4f}\", \"Qini\": f\"{q:+.4f}\"})\n",
    "\n",
    "results = pd.DataFrame(rows)\n",
    "print(results.to_string(index=False))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "### 5.1  Qini Curves\n",
    "\n",
    "The Qini curve generalises the uplift curve by weighting for treatment/control group sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(10, 5))\n",
    "for name, scores in cate_preds.items():\n",
    "    fracs, qvals = qini_curve(y_test, w_test, scores)\n",
    "    plt.plot(fracs, qvals, label=name)\n",
    "\n",
    "plt.plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n",
    "plt.title(\"Qini Curves — Model Comparison\")\n",
    "plt.xlabel(\"Fraction of Population\")\n",
    "plt.ylabel(\"Qini Value\")\n",
    "plt.legend()\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "| Concept | Insight |\n",
    "|---|---|\n",
    "| **S-Learner** | Simplest; can underfit heterogeneity because treatment is just one feature. |\n",
    "| **T-Learner** | Fits separate models; may overfit when treatment/control groups differ in size. |\n",
    "| **X-Learner** | Uses cross-imputation; good when treatment groups are unbalanced. |\n",
    "| **DR-Learner** | Doubly robust to misspecification of either outcome or propensity model. |\n",
    "| **R-Learner** | Directly optimises CATE via residual-on-residual regression; often best for RCTs. |\n",
    "| **AUUC / Qini** | Use these to compare learners when ground-truth CATE is not observed. |\n",
    "| **Subgroup analysis** | Validate that predicted heterogeneity aligns with observed uplift within quintiles. |"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}