Instrumental Variables (Boosted IV)

In this tutorial, we will use the Wine Quality dataset to estimate the causal effect of alcohol content on wine quality. The problem is that factors like “grape quality” are unobserved and affect both alcohol levels and wine quality (confounding). We will treat “sulphates” as an instrument (\(Z\)), assuming it affects alcohol content but has no direct effect on quality other than through alcohol, purely for demonstration purposes.

In this tutorial, we will use the Card (1995) dataset to estimate the causal effect of education on earnings. The problem is that factors like “ability” are unobserved and affect both education levels and earnings (confounding). Card proposed using “proximity to college” as an instrument (\(Z\)), assuming it affects education but has no direct effect on earnings.

Perpetual’s BraidedBooster implements a boosted Control Function approach, which avoids the biased “Forbidden Regression” often found in naive boosted IV implementations.

[ ]:
import numpy as np
from perpetual import PerpetualBooster
from perpetual.iv import BraidedBooster
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Load the Dataset

We fetch the Wine Quality dataset from OpenML.

[ ]:
print("Fetching 'wine-quality' dataset for IV demo...")
# Use the 'wine-quality' dataset as a more realistic stand-in for IV demonstration
data = fetch_openml(name="wine-quality-red", as_frame=True, parser="auto")
df = data.frame
print(df.columns)
print(df.head())
[ ]:
# Preprocessing for wine-quality IV demo
# We'll treat 'class' as the outcome, 'alcohol' as treatment, 'sulphates' as instrument, and the rest as covariates

y = df["class"].values  # Outcome: Wine quality score
w = df["alcohol"].values  # Treatment: Alcohol content
z = df["sulphates"].values  # Instrument: Sulphates

# Covariates (all except outcome, treatment, instrument)
covariates = [
    "fixed_acidity",
    "volatile_acidity",
    "citric_acid",
    "residual_sugar",
    "chlorides",
    "free_sulfur_dioxide",
    "total_sulfur_dioxide",
    "density",
    "pH",
]
X = df[covariates].copy()

X_train, X_test, z_train, z_test, y_train, y_test, w_train, w_test = train_test_split(
    X, z, y, w, test_size=0.2, random_state=42
)

print(f"Dataset shape: {df.shape}")

2. Naive Model vs. IV Model

First, let’s see why a naive model might be biased. We’ll fit a standard PerpetualBooster on \(X\) and \(W\) directly.

[ ]:
naive = PerpetualBooster(budget=0.1)
# Combine X and W for naive fit
X_naive = np.column_stack([X_train, w_train])
naive.fit(X_naive, y_train)

# Estimate effect: Average change in y if alcohol increases by 1 unit
X_test_base = np.column_stack([X_test, w_test.astype(float)])
X_test_plus = np.column_stack([X_test, (w_test.astype(float) + 1)])
# Ensure predictions are float for subtraction
pred_base = naive.predict(X_test_base).astype(float)
pred_plus = naive.predict(X_test_plus).astype(float)
naive_effect = (pred_plus - pred_base).mean()
print(f"Naive estimated effect of 1 unit alcohol: {naive_effect:.4f}")

3. BraidedBooster (IV)

The BraidedBooster uses the instrument to find the variation in education that is uncorrelated with the unobserved confounders.

[ ]:
# Initialize and fit IV model
iv_model = BraidedBooster(stage1_budget=0.1, stage2_budget=0.1)

# X: covariates, Z: instruments, y: outcome, w: treatment
# Z can be a matrix if you have multiple instruments
Z_train = z_train.reshape(-1, 1)
Z_test = z_test.reshape(-1, 1)

iv_model.fit(X_train, Z_train, y_train, w_train)

# Predict causal effect
# We compare counterfactual predictions at w and w+1
y_pred_base = iv_model.predict(X_test, w_counterfactual=w_test)
y_pred_plus = iv_model.predict(X_test, w_counterfactual=w_test + 1)
causal_effect = (y_pred_plus - y_pred_base).mean()

print(f"IV estimated causal effect of 1 unit alcohol: {causal_effect:.4f}")

3.1 Advanced: Interaction Constraints

Just like the base booster, BraidedBooster supports interaction constraints. This can be crucial in IV models to prevent the stage 2 model from leveraging spurious interactions between covariates and the predicted treatment.

[ ]:
# Example: Allow only 'exper' (0) and 'black' (2) to interact
interaction_constraints = [[0, 2]]
iv_constrained = BraidedBooster(
    stage1_budget=0.1,
    stage2_budget=0.1,
    interaction_constraints=interaction_constraints,
)
iv_constrained.fit(X_train, Z_train.reshape(-1, 1), y_train, w_train)
iv_effect_constrained = (
    iv_constrained.predict(X_test, w_test + 1) - iv_constrained.predict(X_test, w_test)
).mean()
print(f"Constrained IV effect: {iv_effect_constrained:.4f}")

Interpretation

If the IV estimate is significantly different from the naive estimate, it suggests the presence of endogeneity (confounding). In our purely illustrative wine quality example, the IV estimate for the alcohol effect might differ from the OLS/Naive estimate, pointing out some unmeasured confounders like grape quality or temperature. If the IV estimate is significantly different from the naive estimate, it suggests the presence of endogeneity (confounding). In many economic studies, the IV estimate for education is actually higher than the OLS/Naive estimate, suggesting that those who are most affected by the instrument (proximity) might have higher returns to schooling.