{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Double Machine Learning: Estimating the Gender Wage Gap\n", "\n", "Double/Debiased Machine Learning (DML) is a modern causal inference method\n", "introduced by Chernozhukov et al. (2018) for estimating treatment effects in\n", "the presence of high-dimensional confounders.\n", "\n", "The **partial-linear model** is:\n", "\n", "$$Y = \\theta(X) \\cdot W + g(X) + \\epsilon$$\n", "\n", "where $\\theta(X)$ is the heterogeneous treatment effect we want to learn.\n", "\n", "In this tutorial we use the **CPS 1985** wages dataset to estimate the\n", "causal effect of gender on wages, controlling for education, experience,\n", "and other confounders.\n", "\n", "Perpetual's `DMLEstimator` handles cross-fitting automatically and uses a\n", "custom DML objective (mirroring the Rust `DMLObjective`) for the final\n", "effect model." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from perpetual.dml import DMLEstimator\n", "from sklearn.datasets import fetch_openml\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## 1. Load the CPS 1985 Wages Dataset\n", "\n", "The Current Population Survey (CPS) 1985 dataset contains information\n", "about workers' wages, education, experience, and demographics." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "print(\"Fetching CPS 1985 Wages dataset...\")\n", "data = fetch_openml(data_id=534, as_frame=True, parser=\"auto\")\n", "df = data.frame\n", "print(f\"Shape: {df.shape}\")\n", "df.head()" ] }, { "cell_type": "markdown", "id": "4", "metadata": {}, "source": [ "## 2. Prepare Features\n", "\n", "We encode the treatment (gender) as binary and prepare covariates." ] }, { "cell_type": "code", "execution_count": null, "id": "5", "metadata": {}, "outputs": [], "source": [ "# Encode categorical variables\n", "df_encoded = pd.get_dummies(\n", " df,\n", " columns=[\"sex\", \"marr\", \"union\", \"race\", \"south\", \"smsa\", \"sector\"],\n", " drop_first=True,\n", " dtype=float,\n", ")\n", "\n", "# Treatment: being female (1 = female, 0 = male)\n", "w = (\n", " df_encoded[\"sex_female\"].values\n", " if \"sex_female\" in df_encoded.columns\n", " else (\n", " 1.0 - df_encoded[\"sex_male\"].values\n", " if \"sex_male\" in df_encoded.columns\n", " else df[\"sex\"].map({\"female\": 1, \"male\": 0}).values\n", " )\n", ")\n", "\n", "# Outcome: log wage (log transform for normality)\n", "y = np.log1p(df_encoded[\"wage\"].values)\n", "\n", "# Covariates: everything except wage and the treatment column\n", "drop_cols = [c for c in df_encoded.columns if c in [\"wage\", \"sex_female\", \"sex_male\"]]\n", "X = df_encoded.drop(columns=drop_cols).values.astype(float)\n", "feature_names = [c for c in df_encoded.columns if c not in drop_cols]\n", "\n", "print(f\"X shape: {X.shape}, Treatment mean: {w.mean():.2f}\")" ] }, { "cell_type": "markdown", "id": "6", "metadata": {}, "source": [ "## 3. Train/Test Split" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(\n", " X, w, y, test_size=0.3, random_state=42\n", ")\n", "print(f\"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}\")" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "## 4. Fit the DML Estimator\n", "\n", "The `DMLEstimator` performs cross-fitting internally:\n", "1. Fits an **outcome nuisance** model $g(X) \\approx E[Y|X]$ on each fold.\n", "2. Fits a **treatment nuisance** model $m(X) \\approx E[W|X]$ on each fold.\n", "3. Computes orthogonalized residuals $\\tilde{Y}$ and $\\tilde{W}$.\n", "4. Fits the **effect model** using a DML-specific custom objective." ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "dml = DMLEstimator(budget=0.5, n_folds=3)\n", "dml.fit(X_train, w_train, y_train)\n", "print(\"DML model fitted.\")" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "## 5. Estimate Heterogeneous Treatment Effects\n", "\n", "The predicted CATE represents how much the wage (in log scale) changes\n", "due to being female, for each individual." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "cate_test = dml.predict(X_test)\n", "\n", "print(f\"Average Treatment Effect (ATE): {cate_test.mean():.4f}\")\n", "print(f\" (in wage terms: {np.expm1(cate_test.mean()):.2%} change)\")\n", "print(f\"Median CATE: {np.median(cate_test):.4f}\")\n", "print(f\"Std of CATE: {cate_test.std():.4f}\")\n", "print(f\"Range: [{cate_test.min():.4f}, {cate_test.max():.4f}]\")" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "## 6. Feature Importance\n", "\n", "Which features drive heterogeneity in the gender wage gap?" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "importances = dml.feature_importances_\n", "top_k = 10\n", "top_idx = np.argsort(importances)[::-1][:top_k]\n", "\n", "print(f\"\\nTop {top_k} features driving CATE heterogeneity:\")\n", "for rank, idx in enumerate(top_idx, 1):\n", " print(f\" {rank}. {feature_names[idx]:25s} importance={importances[idx]:.4f}\")" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## 7. Compare with Naive Estimate\n", "\n", "A naive comparison of means ignores confounders. DML accounts for\n", "differences in education, experience, sector, etc." ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "naive_ate = y_test[w_test == 1].mean() - y_test[w_test == 0].mean()\n", "dml_ate = cate_test.mean()\n", "\n", "print(f\"Naive ATE (difference in means): {naive_ate:.4f}\")\n", "print(f\"DML ATE (cross-fitted): {dml_ate:.4f}\")\n", "print(\"\\nThe DML estimate accounts for confounders like education and experience.\")" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "## 8. Subgroup Analysis\n", "\n", "Examine how the treatment effect varies across subgroups." ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "# Split by median CATE\n", "median_cate = np.median(cate_test)\n", "high_effect = cate_test >= median_cate\n", "low_effect = cate_test < median_cate\n", "\n", "print(\n", " f\"Subgroup with higher wage gap: mean CATE = {cate_test[high_effect].mean():.4f}\"\n", ")\n", "print(f\"Subgroup with lower wage gap: mean CATE = {cate_test[low_effect].mean():.4f}\")" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "## Summary\n", "\n", "In this tutorial we:\n", "\n", "1. Used **DMLEstimator** with real-world CPS wage data.\n", "2. Estimated the **heterogeneous causal effect** of gender on wages.\n", "3. Leveraged cross-fitting to avoid overfitting nuisance models.\n", "4. Identified which features drive **variation** in the wage gap.\n", "5. Compared the DML estimate with a naive difference-in-means.\n", "\n", "### References\n", "\n", "- Chernozhukov, V. et al. (2018). *Double/Debiased Machine Learning for\n", " Treatment and Structural Parameters*. The Econometrics Journal, 21(1)." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }