Instrumental Variables (Boosted IV)

Instrumental variables (IV) are a powerful tool in causal inference for estimating causal effects when there is unobserved confounding between the treatment \(W\) and the outcome \(Y\).

In this tutorial, we will use the Card (1995) dataset to estimate the causal effect of education on earnings. The problem is that factors like “ability” are unobserved and affect both education levels and earnings (confounding). Card proposed using “proximity to college” as an instrument (\(Z\)), assuming it affects education but has no direct effect on earnings.

Perpetual’s BraidedBooster implements a boosted 2-Stage Least Squares (2SLS) approach.

[ ]:
import numpy as np
from perpetual import PerpetualBooster
from perpetual.iv import BraidedBooster
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Load the Dataset

We fetch the Card 1995 dataset from OpenML.

[ ]:
print("Fetching Card 1995 dataset...")
# Data ID for Education and Earnings (Card 1995)
data = fetch_openml(data_id=44321, as_frame=True, parser="auto")
df = data.frame
df.head()
[ ]:
# Preprocessing
y = df["lwage"].values  # Outcome: Log wage
w = df["educ"].values  # Treatment: Years of education
z = df["nearc4"].values.astype(int)  # Instrument: Proximity to 4-year college

# Covariates
covariates = [
    "exper",
    "expersq",
    "black",
    "south",
    "smsa",
    "reg661",
    "reg662",
    "reg663",
    "reg664",
    "reg665",
    "reg666",
    "reg667",
    "reg668",
    "smsa66",
]
X = df[covariates].copy()

X_train, X_test, z_train, z_test, y_train, y_test, w_train, w_test = train_test_split(
    X, z, y, w, test_size=0.2, random_state=42
)

print(f"Dataset shape: {df.shape}")

2. Naive Model vs. IV Model

First, let’s see why a naive model might be biased. We’ll fit a standard PerpetualBooster on \(X\) and \(W\) directly.

[ ]:
naive = PerpetualBooster(budget=0.1)
# Combine X and W for naive fit
X_naive = np.column_stack([X_train, w_train])
naive.fit(X_naive, y_train)

# Estimate effect: Average change in y if education increases by 1 year
X_test_base = np.column_stack([X_test, w_test])
X_test_plus = np.column_stack([X_test, w_test + 1])
naive_effect = (naive.predict(X_test_plus) - naive.predict(X_test_base)).mean()
print(f"Naive estimated effect of 1 year of education: {naive_effect:.4f}")

3. BraidedBooster (IV)

The BraidedBooster uses the instrument to find the variation in education that is uncorrelated with the unobserved confounders.

[ ]:
# Initialize and fit IV model
iv_model = BraidedBooster(stage1_budget=0.1, stage2_budget=0.1)

# X: covariates, Z: instruments, y: outcome, w: treatment
# Z can be a matrix if you have multiple instruments
Z_train = z_train.reshape(-1, 1)
Z_test = z_test.reshape(-1, 1)

iv_model.fit(X_train, Z_train, y_train, w_train)

# Predict causal effect
# We compare counterfactual predictions at w and w+1
y_pred_base = iv_model.predict(X_test, w_counterfactual=w_test)
y_pred_plus = iv_model.predict(X_test, w_counterfactual=w_test + 1)
causal_effect = (y_pred_plus - y_pred_base).mean()

print(f"IV estimated causal effect of 1 year of education: {causal_effect:.4f}")

3.1 Advanced: Interaction Constraints

Just like the base booster, BraidedBooster supports interaction constraints. This can be crucial in IV models to prevent the stage 2 model from leveraging spurious interactions between covariates and the predicted treatment.

[ ]:
# Example: Allow only 'exper' (0) and 'black' (2) to interact
interaction_constraints = [[0, 2]]
iv_constrained = BraidedBooster(
    stage1_budget=0.1,
    stage2_budget=0.1,
    interaction_constraints=interaction_constraints,
)
iv_constrained.fit(X_train, Z_train.reshape(-1, 1), y_train, w_train)
iv_effect_constrained = (
    iv_constrained.predict(X_test, w_test + 1) - iv_constrained.predict(X_test, w_test)
).mean()
print(f"Constrained IV effect: {iv_effect_constrained:.4f}")

Interpretation

If the IV estimate is significantly different from the naive estimate, it suggests the presence of endogeneity (confounding). In many economic studies, the IV estimate for education is actually higher than the OLS/Naive estimate, suggesting that those who are most affected by the instrument (proximity) might have higher returns to schooling.