{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Customer Retention: Uplift Modeling for Churn Prevention\n",
    "\n",
    "Customer churn is a critical problem in subscription businesses (telecom, SaaS, banking). A common intervention is a **retention offer** (discount, loyalty reward, personal call). However, not every customer benefits equally from such an offer:\n",
    "\n",
    "- **Persuadables** — would churn without the offer but stay if treated. *Target these.*\n",
    "- **Sure Things** — will stay regardless. *Waste of budget.*\n",
    "- **Lost Causes** — will churn regardless. *Waste of budget.*\n",
    "- **Sleeping Dogs** — will stay without contact but churn if contacted. *Avoid these!*\n",
    "\n",
    "Uplift modeling identifies the **persuadables** by estimating the Conditional Average Treatment Effect (CATE) of the retention intervention on each customer.\n",
    "\n",
    "This tutorial demonstrates a full **churn uplift** pipeline:\n",
    "\n",
    "1. Simulate a retention campaign dataset.\n",
    "2. Estimate CATE with S-Learner, T-Learner, X-Learner, DR-Learner, and R-Learner.\n",
    "3. Build a targeting policy and measure incremental revenue.\n",
    "4. Evaluate with uplift curves, AUUC, and Qini.\n",
    "\n",
    "> **Note:** We use the **Bank Marketing** dataset from UCI/OpenML as a realistic customer base and simulate a retention RCT on top of it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from perpetual.causal_metrics import auuc, cumulative_gain_curve, qini_coefficient\n",
    "from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner\n",
    "from perpetual.uplift import UpliftBooster\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "## 1. Prepare a Churn Retention Dataset\n",
    "\n",
    "We use the **Bank Marketing** dataset (OpenML ID 1461) which records whether clients subscribed to a term deposit after a marketing campaign. We re-frame this as a churn-prevention scenario:\n",
    "\n",
    "- **Outcome $Y$**: whether the customer was *retained* (subscribed).\n",
    "- **Treatment $W$**: whether the customer received a targeted retention call (simulated RCT).\n",
    "\n",
    "Since the original dataset is observational, we construct a clean RCT by randomly assigning treatment and simulating heterogeneous response."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = fetch_openml(data_id=1461, as_frame=True, parser=\"auto\")\n",
    "df = data.frame\n",
    "print(f\"Raw samples: {len(df):,}\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select features (drop the original outcome and campaign-related cols)\n",
    "feature_cols = [\n",
    "    \"age\",\n",
    "    \"job\",\n",
    "    \"marital\",\n",
    "    \"education\",\n",
    "    \"default\",\n",
    "    \"balance\",\n",
    "    \"housing\",\n",
    "    \"loan\",\n",
    "]\n",
    "X = df[feature_cols].copy()\n",
    "\n",
    "# Mark categoricals\n",
    "for c in X.select_dtypes(include=[\"object\", \"category\"]).columns:\n",
    "    X[c] = X[c].astype(\"category\")\n",
    "\n",
    "n = len(X)\n",
    "rng = np.random.default_rng(42)\n",
    "\n",
    "# --- Simulate an RCT ---\n",
    "# Random treatment assignment (50/50)\n",
    "w = rng.binomial(1, 0.5, size=n)\n",
    "\n",
    "# Baseline churn probability (higher for young, low-balance customers)\n",
    "age_norm = (df[\"age\"].astype(float).values - 30) / 30\n",
    "balance_norm = (df[\"balance\"].astype(float).values - 1000) / 5000\n",
    "base_logit = -0.5 - 0.3 * age_norm + 0.2 * balance_norm\n",
    "\n",
    "# Heterogeneous treatment effect:\n",
    "#   - Young customers (age < 35) respond well to retention offers\n",
    "#   - Customers with housing loans also respond positively\n",
    "#   - High-balance customers don't need the offer (\"sure things\")\n",
    "has_housing = (df[\"housing\"] == \"yes\").astype(float).values\n",
    "is_young = (df[\"age\"].astype(float).values < 35).astype(float)\n",
    "tau = 0.15 * is_young + 0.10 * has_housing - 0.08 * (balance_norm > 0.5)\n",
    "\n",
    "# Generate outcome\n",
    "prob_retain = 1 / (1 + np.exp(-(base_logit + tau * w)))\n",
    "y = rng.binomial(1, prob_retain)\n",
    "\n",
    "print(f\"Treatment rate: {w.mean():.2%}\")\n",
    "print(f\"Retention rate (treated):  {y[w == 1].mean():.2%}\")\n",
    "print(f\"Retention rate (control):  {y[w == 0].mean():.2%}\")\n",
    "print(f\"True ATE: {tau.mean():.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, w_train, w_test, y_train, y_test, tau_test = train_test_split(\n",
    "    X, w, y, tau, test_size=0.3, random_state=42\n",
    ")\n",
    "print(f\"Train: {len(X_train):,}  |  Test: {len(X_test):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## 2. Estimate CATE with Multiple Learners"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "learners = {\n",
    "    \"S-Learner\": SLearner(budget=0.3),\n",
    "    \"T-Learner\": TLearner(budget=0.3),\n",
    "    \"X-Learner\": XLearner(budget=0.3),\n",
    "    \"DR-Learner\": DRLearner(budget=0.3, clip=0.01),\n",
    "}\n",
    "\n",
    "cate_preds = {}\n",
    "for name, learner in learners.items():\n",
    "    learner.fit(X_train, w_train, y_train)\n",
    "    cate_preds[name] = learner.predict(X_test)\n",
    "\n",
    "# R-Learner\n",
    "rl = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)\n",
    "rl.fit(X_train, w_train, y_train)\n",
    "cate_preds[\"R-Learner\"] = rl.predict(X_test)\n",
    "\n",
    "for name, tau_hat in cate_preds.items():\n",
    "    corr = np.corrcoef(tau_test, tau_hat)[0, 1]\n",
    "    rmse = np.sqrt(np.mean((tau_test - tau_hat) ** 2))\n",
    "    print(\n",
    "        f\"{name:12s}  avg CATE = {tau_hat.mean():+.4f}  \"\n",
    "        f\"corr(true) = {corr:.3f}  RMSE = {rmse:.4f}\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "## 3. Targeting Policy — Who Should Receive the Offer?\n",
    "\n",
    "A **targeting policy** assigns treatment to customers whose predicted uplift exceeds a threshold. We use a simple rule: treat if $\\hat{\\tau}(x) > 0$.\n",
    "\n",
    "We compare three scenarios:\n",
    "1. **Treat nobody** (control baseline)\n",
    "2. **Treat everybody** (blanket policy)\n",
    "3. **Treat top-k by predicted CATE** (uplift-based targeting)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use DR-Learner for targeting\n",
    "tau_hat = cate_preds[\"DR-Learner\"]\n",
    "\n",
    "# Offer cost: $10 per contacted customer\n",
    "# Revenue from retained customer: $100\n",
    "OFFER_COST = 10\n",
    "RETAIN_VALUE = 100\n",
    "\n",
    "\n",
    "def evaluate_policy(treat_mask, y_obs, w_obs, tau_true):\n",
    "    \"\"\"Estimate incremental outcomes under a targeting policy.\"\"\"\n",
    "    n_treated = treat_mask.sum()\n",
    "    # Among those targeted AND actually treated in the RCT\n",
    "    expected_uplift = tau_true[treat_mask].mean() if n_treated > 0 else 0\n",
    "    incremental_retentions = expected_uplift * n_treated\n",
    "    revenue = incremental_retentions * RETAIN_VALUE - n_treated * OFFER_COST\n",
    "    return {\n",
    "        \"n_targeted\": n_treated,\n",
    "        \"pct_targeted\": n_treated / len(y_obs),\n",
    "        \"avg_true_cate\": expected_uplift,\n",
    "        \"est_incremental_revenue\": revenue,\n",
    "    }\n",
    "\n",
    "\n",
    "policies = {\n",
    "    \"Treat nobody\": np.zeros(len(X_test), dtype=bool),\n",
    "    \"Treat all\": np.ones(len(X_test), dtype=bool),\n",
    "    \"Top 30%\": tau_hat >= np.percentile(tau_hat, 70),\n",
    "    \"Top 50%\": tau_hat >= np.percentile(tau_hat, 50),\n",
    "    \"CATE > 0\": tau_hat > 0,\n",
    "}\n",
    "\n",
    "rows = []\n",
    "for name, mask in policies.items():\n",
    "    res = evaluate_policy(mask, y_test, w_test, tau_test)\n",
    "    rows.append({\"Policy\": name, **res})\n",
    "\n",
    "policy_df = pd.DataFrame(rows)\n",
    "print(policy_df.to_string(index=False, float_format=\"%.3f\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(8, 4))\n",
    "ax.barh(policy_df[\"Policy\"], policy_df[\"est_incremental_revenue\"])\n",
    "ax.set_xlabel(\"Estimated Incremental Revenue ($)\")\n",
    "ax.set_title(\"Revenue by Targeting Policy\")\n",
    "ax.axvline(0, color=\"grey\", linewidth=0.5)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## 4. Evaluation: Uplift Curves and Metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Cumulative gain\n",
    "for name, scores in cate_preds.items():\n",
    "    fracs, gains = cumulative_gain_curve(y_test, w_test, scores)\n",
    "    axes[0].plot(fracs, gains, label=name)\n",
    "axes[0].plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n",
    "axes[0].set_title(\"Cumulative Gain Curves\")\n",
    "axes[0].set_xlabel(\"Fraction Targeted\")\n",
    "axes[0].set_ylabel(\"Cumulative Gain\")\n",
    "axes[0].legend(fontsize=8)\n",
    "\n",
    "# True CATE vs. predicted (DR-Learner)\n",
    "axes[1].scatter(tau_test, cate_preds[\"DR-Learner\"], alpha=0.1, s=4)\n",
    "lims = [\n",
    "    min(tau_test.min(), cate_preds[\"DR-Learner\"].min()),\n",
    "    max(tau_test.max(), cate_preds[\"DR-Learner\"].max()),\n",
    "]\n",
    "axes[1].plot(lims, lims, \"r--\", label=\"Perfect\")\n",
    "axes[1].set_xlabel(\"True CATE\")\n",
    "axes[1].set_ylabel(\"Predicted CATE (DR-Learner)\")\n",
    "axes[1].set_title(\"True vs. Predicted CATE\")\n",
    "axes[1].legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary metrics\n",
    "print(\n",
    "    f\"{'Learner':12s}  {'AUUC (norm)':>12}  {'Qini':>8}  {'Corr(true)':>12}  {'RMSE':>8}\"\n",
    ")\n",
    "print(\"-\" * 60)\n",
    "for name, tau_hat in cate_preds.items():\n",
    "    a = auuc(y_test, w_test, tau_hat, normalize=True)\n",
    "    q = qini_coefficient(y_test, w_test, tau_hat)\n",
    "    corr = np.corrcoef(tau_test, tau_hat)[0, 1]\n",
    "    rmse = np.sqrt(np.mean((tau_test - tau_hat) ** 2))\n",
    "    print(f\"{name:12s}  {a:>+12.4f}  {q:>+8.4f}  {corr:>12.4f}  {rmse:>8.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## 5. Deep Dive: Who Are the Persuadables?\n",
    "\n",
    "We segment the test set by predicted CATE decile and inspect the demographic profile of the top segment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "tau_hat_dr = cate_preds[\"DR-Learner\"]\n",
    "decile = pd.qcut(tau_hat_dr, 10, labels=False, duplicates=\"drop\")\n",
    "\n",
    "test_df = X_test.copy()\n",
    "test_df[\"cate_hat\"] = tau_hat_dr\n",
    "test_df[\"true_cate\"] = tau_test\n",
    "test_df[\"decile\"] = decile\n",
    "\n",
    "agg = test_df.groupby(\"decile\").agg(\n",
    "    avg_cate_hat=(\"cate_hat\", \"mean\"),\n",
    "    avg_true_cate=(\"true_cate\", \"mean\"),\n",
    "    avg_age=(\"age\", lambda x: x.astype(float).mean()),\n",
    "    avg_balance=(\"balance\", lambda x: x.astype(float).mean()),\n",
    "    n=(\"cate_hat\", \"count\"),\n",
    ")\n",
    "print(agg.to_string(float_format=\"%.4f\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Profile the top decile\n",
    "top_decile = test_df[test_df[\"decile\"] == test_df[\"decile\"].max()]\n",
    "print(f\"Top Decile Profile (n={len(top_decile)})\")\n",
    "print(f\"  Avg predicted CATE: {top_decile['cate_hat'].mean():.4f}\")\n",
    "print(f\"  Avg true CATE:      {top_decile['true_cate'].mean():.4f}\")\n",
    "print(f\"  Avg age:            {top_decile['age'].astype(float).mean():.1f}\")\n",
    "if \"job\" in top_decile.columns:\n",
    "    print(f\"  Top jobs: {top_decile['job'].value_counts().head(3).to_dict()}\")\n",
    "if \"housing\" in top_decile.columns:\n",
    "    print(f\"  Housing loan: {(top_decile['housing'] == 'yes').mean():.1%}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## Key Takeaways\n",
    "\n",
    "| Insight | Details |\n",
    "|---|---|\n",
    "| **Not everyone benefits from treatment** | Blanket campaigns waste budget on sure-things and sleeping dogs. |\n",
    "| **Uplift-based targeting improves ROI** | Targeting the top persuadables yields higher incremental revenue than treating everyone. |\n",
    "| **Multiple learners, one winner** | Comparing S/T/X/DR/R-Learners on AUUC and Qini helps select the best model for your data. |\n",
    "| **Subgroup profiling** | Decile analysis reveals the demographic characteristics of persuadable customers. |\n",
    "| **Business integration** | CATE estimates + cost/revenue parameters → actionable targeting thresholds. |"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}