{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Double Machine Learning: Estimating the Gender Wage Gap\n",
    "\n",
    "Double/Debiased Machine Learning (DML) is a modern causal inference method\n",
    "introduced by Chernozhukov et al. (2018) for estimating treatment effects in\n",
    "the presence of high-dimensional confounders.\n",
    "\n",
    "The **partial-linear model** is:\n",
    "\n",
    "$$Y = \\theta(X) \\cdot W + g(X) + \\epsilon$$\n",
    "\n",
    "where $\\theta(X)$ is the heterogeneous treatment effect we want to learn.\n",
    "\n",
    "In this tutorial we use the **CPS 1985** wages dataset to estimate the\n",
    "causal effect of gender on wages, controlling for education, experience,\n",
    "and other confounders.\n",
    "\n",
    "Perpetual's `DMLEstimator` handles cross-fitting automatically and uses a\n",
    "custom DML objective (mirroring the Rust `DMLObjective`) for the final\n",
    "effect model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from perpetual.dml import DMLEstimator\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "## 1. Load the CPS 1985 Wages Dataset\n",
    "\n",
    "The Current Population Survey (CPS) 1985 dataset contains information\n",
    "about workers' wages, education, experience, and demographics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Fetching CPS 1985 Wages dataset...\")\n",
    "data = fetch_openml(data_id=534, as_frame=True, parser=\"auto\")\n",
    "df = data.frame\n",
    "print(f\"Shape: {df.shape}\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "## 2. Prepare Features\n",
    "\n",
    "We encode the treatment (gender) as binary and prepare covariates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encode categorical variables\n",
    "df_encoded = pd.get_dummies(\n",
    "    df,\n",
    "    columns=[\"sex\", \"marr\", \"union\", \"race\", \"south\", \"smsa\", \"sector\"],\n",
    "    drop_first=True,\n",
    "    dtype=float,\n",
    ")\n",
    "\n",
    "# Treatment: being female (1 = female, 0 = male)\n",
    "w = (\n",
    "    df_encoded[\"sex_female\"].values\n",
    "    if \"sex_female\" in df_encoded.columns\n",
    "    else (\n",
    "        1.0 - df_encoded[\"sex_male\"].values\n",
    "        if \"sex_male\" in df_encoded.columns\n",
    "        else df[\"sex\"].map({\"female\": 1, \"male\": 0}).values\n",
    "    )\n",
    ")\n",
    "\n",
    "# Outcome: log wage (log transform for normality)\n",
    "y = np.log1p(df_encoded[\"wage\"].values)\n",
    "\n",
    "# Covariates: everything except wage and the treatment column\n",
    "drop_cols = [c for c in df_encoded.columns if c in [\"wage\", \"sex_female\", \"sex_male\"]]\n",
    "X = df_encoded.drop(columns=drop_cols).values.astype(float)\n",
    "feature_names = [c for c in df_encoded.columns if c not in drop_cols]\n",
    "\n",
    "print(f\"X shape: {X.shape}, Treatment mean: {w.mean():.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "## 3. Train/Test Split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(\n",
    "    X, w, y, test_size=0.3, random_state=42\n",
    ")\n",
    "print(f\"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "## 4. Fit the DML Estimator\n",
    "\n",
    "The `DMLEstimator` performs cross-fitting internally:\n",
    "1. Fits an **outcome nuisance** model $g(X) \\approx E[Y|X]$ on each fold.\n",
    "2. Fits a **treatment nuisance** model $m(X) \\approx E[W|X]$ on each fold.\n",
    "3. Computes orthogonalized residuals $\\tilde{Y}$ and $\\tilde{W}$.\n",
    "4. Fits the **effect model** using a DML-specific custom objective."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "dml = DMLEstimator(budget=0.5, n_folds=3)\n",
    "dml.fit(X_train, w_train, y_train)\n",
    "print(\"DML model fitted.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "## 5. Estimate Heterogeneous Treatment Effects\n",
    "\n",
    "The predicted CATE represents how much the wage (in log scale) changes\n",
    "due to being female, for each individual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "cate_test = dml.predict(X_test)\n",
    "\n",
    "print(f\"Average Treatment Effect (ATE): {cate_test.mean():.4f}\")\n",
    "print(f\"  (in wage terms: {np.expm1(cate_test.mean()):.2%} change)\")\n",
    "print(f\"Median CATE: {np.median(cate_test):.4f}\")\n",
    "print(f\"Std of CATE: {cate_test.std():.4f}\")\n",
    "print(f\"Range: [{cate_test.min():.4f}, {cate_test.max():.4f}]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "## 6. Feature Importance\n",
    "\n",
    "Which features drive heterogeneity in the gender wage gap?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "importances = dml.feature_importances_\n",
    "top_k = 10\n",
    "top_idx = np.argsort(importances)[::-1][:top_k]\n",
    "\n",
    "print(f\"\\nTop {top_k} features driving CATE heterogeneity:\")\n",
    "for rank, idx in enumerate(top_idx, 1):\n",
    "    print(f\"  {rank}. {feature_names[idx]:25s}  importance={importances[idx]:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## 7. Compare with Naive Estimate\n",
    "\n",
    "A naive comparison of means ignores confounders. DML accounts for\n",
    "differences in education, experience, sector, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "naive_ate = y_test[w_test == 1].mean() - y_test[w_test == 0].mean()\n",
    "dml_ate = cate_test.mean()\n",
    "\n",
    "print(f\"Naive ATE (difference in means): {naive_ate:.4f}\")\n",
    "print(f\"DML ATE (cross-fitted):          {dml_ate:.4f}\")\n",
    "print(\"\\nThe DML estimate accounts for confounders like education and experience.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "## 8. Subgroup Analysis\n",
    "\n",
    "Examine how the treatment effect varies across subgroups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split by median CATE\n",
    "median_cate = np.median(cate_test)\n",
    "high_effect = cate_test >= median_cate\n",
    "low_effect = cate_test < median_cate\n",
    "\n",
    "print(\n",
    "    f\"Subgroup with higher wage gap:  mean CATE = {cate_test[high_effect].mean():.4f}\"\n",
    ")\n",
    "print(f\"Subgroup with lower wage gap:   mean CATE = {cate_test[low_effect].mean():.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial we:\n",
    "\n",
    "1. Used **DMLEstimator** with real-world CPS wage data.\n",
    "2. Estimated the **heterogeneous causal effect** of gender on wages.\n",
    "3. Leveraged cross-fitting to avoid overfitting nuisance models.\n",
    "4. Identified which features drive **variation** in the wage gap.\n",
    "5. Compared the DML estimate with a naive difference-in-means.\n",
    "\n",
    "### References\n",
    "\n",
    "- Chernozhukov, V. et al. (2018). *Double/Debiased Machine Learning for\n",
    "  Treatment and Structural Parameters*. The Econometrics Journal, 21(1)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}