{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instrumental Variables (Boosted IV)\n",
    "\n",
    "Instrumental variables (IV) are a powerful tool in causal inference for estimating causal effects when there is unobserved confounding between the treatment $W$ and the outcome $Y$.\n",
    "\n",
    "In this tutorial, we will use the **Card (1995)** dataset to estimate the causal effect of education on earnings. The problem is that factors like \"ability\" are unobserved and affect both education levels and earnings (confounding). Card proposed using \"proximity to college\" as an **instrument** ($Z$), assuming it affects education but has no direct effect on earnings.\n",
    "\n",
    "Perpetual's `BraidedBooster` implements a boosted 2-Stage Least Squares (2SLS) approach."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from perpetual import PerpetualBooster\n",
    "from perpetual.iv import BraidedBooster\n",
    "from sklearn.datasets import fetch_openml\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load the Dataset\n",
    "\n",
    "We fetch the Card 1995 dataset from OpenML."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Fetching Card 1995 dataset...\")\n",
    "# Data ID for Education and Earnings (Card 1995)\n",
    "data = fetch_openml(data_id=44321, as_frame=True, parser=\"auto\")\n",
    "df = data.frame\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Preprocessing\n",
    "y = df[\"lwage\"].values  # Outcome: Log wage\n",
    "w = df[\"educ\"].values  # Treatment: Years of education\n",
    "z = df[\"nearc4\"].values.astype(int)  # Instrument: Proximity to 4-year college\n",
    "\n",
    "# Covariates\n",
    "covariates = [\n",
    "    \"exper\",\n",
    "    \"expersq\",\n",
    "    \"black\",\n",
    "    \"south\",\n",
    "    \"smsa\",\n",
    "    \"reg661\",\n",
    "    \"reg662\",\n",
    "    \"reg663\",\n",
    "    \"reg664\",\n",
    "    \"reg665\",\n",
    "    \"reg666\",\n",
    "    \"reg667\",\n",
    "    \"reg668\",\n",
    "    \"smsa66\",\n",
    "]\n",
    "X = df[covariates].copy()\n",
    "\n",
    "X_train, X_test, z_train, z_test, y_train, y_test, w_train, w_test = train_test_split(\n",
    "    X, z, y, w, test_size=0.2, random_state=42\n",
    ")\n",
    "\n",
    "print(f\"Dataset shape: {df.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Naive Model vs. IV Model\n",
    "\n",
    "First, let's see why a naive model might be biased. We'll fit a standard `PerpetualBooster` on $X$ and $W$ directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "naive = PerpetualBooster(budget=0.1)\n",
    "# Combine X and W for naive fit\n",
    "X_naive = np.column_stack([X_train, w_train])\n",
    "naive.fit(X_naive, y_train)\n",
    "\n",
    "# Estimate effect: Average change in y if education increases by 1 year\n",
    "X_test_base = np.column_stack([X_test, w_test])\n",
    "X_test_plus = np.column_stack([X_test, w_test + 1])\n",
    "naive_effect = (naive.predict(X_test_plus) - naive.predict(X_test_base)).mean()\n",
    "print(f\"Naive estimated effect of 1 year of education: {naive_effect:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. BraidedBooster (IV)\n",
    "\n",
    "The `BraidedBooster` uses the instrument to find the variation in education that is uncorrelated with the unobserved confounders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize and fit IV model\n",
    "iv_model = BraidedBooster(stage1_budget=0.1, stage2_budget=0.1)\n",
    "\n",
    "# X: covariates, Z: instruments, y: outcome, w: treatment\n",
    "# Z can be a matrix if you have multiple instruments\n",
    "Z_train = z_train.reshape(-1, 1)\n",
    "Z_test = z_test.reshape(-1, 1)\n",
    "\n",
    "iv_model.fit(X_train, Z_train, y_train, w_train)\n",
    "\n",
    "# Predict causal effect\n",
    "# We compare counterfactual predictions at w and w+1\n",
    "y_pred_base = iv_model.predict(X_test, w_counterfactual=w_test)\n",
    "y_pred_plus = iv_model.predict(X_test, w_counterfactual=w_test + 1)\n",
    "causal_effect = (y_pred_plus - y_pred_base).mean()\n",
    "\n",
    "print(f\"IV estimated causal effect of 1 year of education: {causal_effect:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Advanced: Interaction Constraints\n",
    "\n",
    "Just like the base booster, `BraidedBooster` supports interaction constraints. This can be crucial in IV models to prevent the stage 2 model from leveraging spurious interactions between covariates and the predicted treatment.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example: Allow only 'exper' (0) and 'black' (2) to interact\n",
    "interaction_constraints = [[0, 2]]\n",
    "iv_constrained = BraidedBooster(\n",
    "    stage1_budget=0.1,\n",
    "    stage2_budget=0.1,\n",
    "    interaction_constraints=interaction_constraints,\n",
    ")\n",
    "iv_constrained.fit(X_train, Z_train.reshape(-1, 1), y_train, w_train)\n",
    "iv_effect_constrained = (\n",
    "    iv_constrained.predict(X_test, w_test + 1) - iv_constrained.predict(X_test, w_test)\n",
    ").mean()\n",
    "print(f\"Constrained IV effect: {iv_effect_constrained:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interpretation\n",
    "\n",
    "If the IV estimate is significantly different from the naive estimate, it suggests the presence of endogeneity (confounding). In many economic studies, the IV estimate for education is actually higher than the OLS/Naive estimate, suggesting that those who are most affected by the instrument (proximity) might have higher returns to schooling."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}