Double Machine Learning: Estimating the Gender Wage Gap
Double/Debiased Machine Learning (DML) is a modern causal inference method introduced by Chernozhukov et al. (2018) for estimating treatment effects in the presence of high-dimensional confounders.
The partial-linear model is:
where \(\theta(X)\) is the heterogeneous treatment effect we want to learn.
In this tutorial we use the CPS 1985 wages dataset to estimate the causal effect of gender on wages, controlling for education, experience, and other confounders.
Perpetual’s DMLEstimator handles cross-fitting automatically and uses a custom DML objective (mirroring the Rust DMLObjective) for the final effect model.
[ ]:
import numpy as np
import pandas as pd
from perpetual.dml import DMLEstimator
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
1. Load the CPS 1985 Wages Dataset
The Current Population Survey (CPS) 1985 dataset contains information about workers’ wages, education, experience, and demographics.
[ ]:
print("Fetching CPS 1985 Wages dataset...")
data = fetch_openml(data_id=534, as_frame=True, parser="auto")
df = data.frame
print(f"Shape: {df.shape}")
df.head()
2. Prepare Features
We encode the treatment (gender) as binary and prepare covariates.
[ ]:
# Encode categorical variables
df_encoded = pd.get_dummies(
df,
columns=["sex", "marr", "union", "race", "south", "smsa", "sector"],
drop_first=True,
dtype=float,
)
# Treatment: being female (1 = female, 0 = male)
w = (
df_encoded["sex_female"].values
if "sex_female" in df_encoded.columns
else (
1.0 - df_encoded["sex_male"].values
if "sex_male" in df_encoded.columns
else df["sex"].map({"female": 1, "male": 0}).values
)
)
# Outcome: log wage (log transform for normality)
y = np.log1p(df_encoded["wage"].values)
# Covariates: everything except wage and the treatment column
drop_cols = [c for c in df_encoded.columns if c in ["wage", "sex_female", "sex_male"]]
X = df_encoded.drop(columns=drop_cols).values.astype(float)
feature_names = [c for c in df_encoded.columns if c not in drop_cols]
print(f"X shape: {X.shape}, Treatment mean: {w.mean():.2f}")
3. Train/Test Split
[ ]:
X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(
X, w, y, test_size=0.3, random_state=42
)
print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")
4. Fit the DML Estimator
The DMLEstimator performs cross-fitting internally:
Fits an outcome nuisance model \(g(X) \approx E[Y|X]\) on each fold.
Fits a treatment nuisance model \(m(X) \approx E[W|X]\) on each fold.
Computes orthogonalized residuals \(\tilde{Y}\) and \(\tilde{W}\).
Fits the effect model using a DML-specific custom objective.
[ ]:
dml = DMLEstimator(budget=0.5, n_folds=3)
dml.fit(X_train, w_train, y_train)
print("DML model fitted.")
5. Estimate Heterogeneous Treatment Effects
The predicted CATE represents how much the wage (in log scale) changes due to being female, for each individual.
[ ]:
cate_test = dml.predict(X_test)
print(f"Average Treatment Effect (ATE): {cate_test.mean():.4f}")
print(f" (in wage terms: {np.expm1(cate_test.mean()):.2%} change)")
print(f"Median CATE: {np.median(cate_test):.4f}")
print(f"Std of CATE: {cate_test.std():.4f}")
print(f"Range: [{cate_test.min():.4f}, {cate_test.max():.4f}]")
6. Feature Importance
Which features drive heterogeneity in the gender wage gap?
[ ]:
importances = dml.feature_importances_
top_k = 10
top_idx = np.argsort(importances)[::-1][:top_k]
print(f"\nTop {top_k} features driving CATE heterogeneity:")
for rank, idx in enumerate(top_idx, 1):
print(f" {rank}. {feature_names[idx]:25s} importance={importances[idx]:.4f}")
7. Compare with Naive Estimate
A naive comparison of means ignores confounders. DML accounts for differences in education, experience, sector, etc.
[ ]:
naive_ate = y_test[w_test == 1].mean() - y_test[w_test == 0].mean()
dml_ate = cate_test.mean()
print(f"Naive ATE (difference in means): {naive_ate:.4f}")
print(f"DML ATE (cross-fitted): {dml_ate:.4f}")
print("\nThe DML estimate accounts for confounders like education and experience.")
8. Subgroup Analysis
Examine how the treatment effect varies across subgroups.
[ ]:
# Split by median CATE
median_cate = np.median(cate_test)
high_effect = cate_test >= median_cate
low_effect = cate_test < median_cate
print(
f"Subgroup with higher wage gap: mean CATE = {cate_test[high_effect].mean():.4f}"
)
print(f"Subgroup with lower wage gap: mean CATE = {cate_test[low_effect].mean():.4f}")
Summary
In this tutorial we:
Used DMLEstimator with real-world CPS wage data.
Estimated the heterogeneous causal effect of gender on wages.
Leveraged cross-fitting to avoid overfitting nuisance models.
Identified which features drive variation in the wage gap.
Compared the DML estimate with a naive difference-in-means.
References
Chernozhukov, V. et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1).