{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Uplift Modeling and Causal Inference\n", "\n", "Uplift modeling (also known as CATE — Conditional Average Treatment Effect) aims to predict the incremental impact of an action (the \"treatment\") on an individual's behavioral outcome.\n", "\n", "In this tutorial, we will use the **Hillstrom (MineThatData)** dataset, a standard benchmark in marketing analytics, to demonstrate how to use Perpetual's causal inference tools:\n", "* `UpliftBooster` (R-Learner)\n", "* `SLearner`\n", "* `TLearner`\n", "* `XLearner`\n", "* `DRLearner` (Doubly Robust)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "from perpetual.causal_metrics import auuc, cumulative_gain_curve, qini_coefficient\n", "from perpetual.meta_learners import DRLearner, SLearner, TLearner, XLearner\n", "from perpetual.uplift import UpliftBooster\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Load the Dataset\n", "\n", "The Hillstrom dataset contains 64,000 customers who were randomly assigned to one of three groups: \n", "1. E-mail for Mens merchandise.\n", "2. E-mail for Womens merchandise.\n", "3. No e-mail (Control group).\n", "\n", "We will simplify this to a binary case: E-mail (any) vs. No E-mail." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Fetching dataset...\")\n", "dataset = fetch_openml(data_id=41473, as_frame=True, parser=\"auto\")\n", "df = dataset.frame\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Preprocessing\n", "# Create binary treatment indicator (segment != 'No E-mail')\n", "df[\"treatment\"] = (df[\"segment\"] != \"No E-mail\").astype(int)\n", "\n", "# Target variable: purchase (binary) or visit (binary)\n", "y = df[\"visit\"].astype(int)\n", "w = df[\"treatment\"].astype(int)\n", "\n", "# Select features\n", "features = [\n", " \"recency\",\n", " \"history_segment\",\n", " \"history\",\n", " \"mens\",\n", " \"womens\",\n", " \"zip_code\",\n", " \"newbie\",\n", " \"channel\",\n", "]\n", "X = df[features].copy()\n", "\n", "# Handle categorical features for Perpetual (automatic or categorical type)\n", "for col in [\"history_segment\", \"zip_code\", \"channel\"]:\n", " X[col] = X[col].astype(\"category\")\n", "\n", "X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(\n", " X, w, y, test_size=0.3, random_state=42\n", ")\n", "print(f\"Training set size: {X_train.shape[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. R-Learner (UpliftBooster)\n", "\n", "The `UpliftBooster` uses the **R-Learner** meta-algorithm, which is highly robust to selection bias and effectively optimizes the residual-on-residual loss." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize and fit UpliftBooster\n", "ub = UpliftBooster(outcome_budget=0.1, propensity_budget=0.01, effect_budget=0.1)\n", "ub.fit(X_train, w_train, y_train)\n", "\n", "# Predicted Treatment Effect\n", "uplift_r = ub.predict(X_test)\n", "print(f\"Average Predicted Uplift (R-Learner): {uplift_r.mean():.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Domain Knowledge: Interaction Constraints\n", "\n", "Perpetual allows you to enforce **Interaction Constraints**. This is useful when you know (from domain expertise) that certain features should only interact with each other, or should not interact at all.\n", "\n", "For example, we might want to allow interactions only within a specific set of features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Enforce that 'recency' and 'history' can interact, but other features cannot interact with them\n", "# Feature indices in 'features' list: 0: recency, 2: history\n", "interaction_constraints = [[0, 2]]\n", "ub_constrained = UpliftBooster(\n", " outcome_budget=0.1,\n", " propensity_budget=0.01,\n", " effect_budget=0.1,\n", " interaction_constraints=interaction_constraints,\n", ")\n", "ub_constrained.fit(X_train, w_train, y_train)\n", "\n", "uplift_constrained = ub_constrained.predict(X_test)\n", "print(f\"Average Uplift (Constrained): {uplift_constrained.mean():.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Comparing with Meta-Learners\n", "\n", "Meta-learners are algorithms that decompose the causal problem into one or more supervised learning problems." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# S-Learner: Single model with treatment as feature\n", "sl = SLearner(budget=0.2)\n", "sl.fit(X_train, w_train, y_train)\n", "uplift_s = sl.predict(X_test)\n", "\n", "# T-Learner: Two models (one per treatment group)\n", "tl = TLearner(budget=0.2)\n", "tl.fit(X_train, w_train, y_train)\n", "uplift_t = tl.predict(X_test)\n", "\n", "# X-Learner: Two-stage learner with imputation\n", "xl = XLearner(budget=0.2)\n", "xl.fit(X_train, w_train, y_train)\n", "uplift_x = xl.predict(X_test)\n", "\n", "# DR-Learner: Doubly Robust / AIPW\n", "dr = DRLearner(budget=0.2, clip=0.01)\n", "dr.fit(X_train, w_train, y_train)\n", "uplift_dr = dr.predict(X_test)\n", "\n", "print(f\"Avg Uplift S: {uplift_s.mean():.4f}\")\n", "print(f\"Avg Uplift T: {uplift_t.mean():.4f}\")\n", "print(f\"Avg Uplift X: {uplift_x.mean():.4f}\")\n", "print(f\"Avg Uplift DR: {uplift_dr.mean():.4f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Evaluation: Uplift Curve\n", "\n", "Since we don't know the \"ground truth\" individual effect, we use the Cumulative Gain (Uplift) curve to evaluate performance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# --- Uplift Gain Curves ---\n", "plt.figure(figsize=(10, 6))\n", "for label, scores in [\n", " (\"R-Learner\", uplift_r),\n", " (\"X-Learner\", uplift_x),\n", " (\"DR-Learner\", uplift_dr),\n", "]:\n", " fracs, gains = cumulative_gain_curve(y_test, w_test, scores)\n", " plt.plot(fracs, gains, label=label)\n", "\n", "plt.plot([0, 1], [0, 0], \"k--\", label=\"Random\")\n", "plt.title(\"Cumulative Uplift Gain — Hillstrom Dataset\")\n", "plt.xlabel(\"Population % Sorted by Predicted Uplift\")\n", "plt.ylabel(\"Cumulative Gain\")\n", "plt.legend()\n", "plt.show()\n", "\n", "# --- AUUC & Qini ---\n", "for label, scores in [\n", " (\"R-Learner\", uplift_r),\n", " (\"S-Learner\", uplift_s),\n", " (\"T-Learner\", uplift_t),\n", " (\"X-Learner\", uplift_x),\n", " (\"DR-Learner\", uplift_dr),\n", "]:\n", " a = auuc(y_test, w_test, scores, normalize=True)\n", " q = qini_coefficient(y_test, w_test, scores)\n", " print(f\"{label:12s} AUUC={a:+.4f} Qini={q:+.4f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }