Double Machine Learning: Estimating the Gender Wage Gap

Double/Debiased Machine Learning (DML) is a modern causal inference method introduced by Chernozhukov et al. (2018) for estimating treatment effects in the presence of high-dimensional confounders.

The partial-linear model is:

\[Y = \theta(X) \cdot W + g(X) + \epsilon\]

where \(\theta(X)\) is the heterogeneous treatment effect we want to learn.

In this tutorial we use the CPS 1985 wages dataset to estimate the causal effect of gender on wages, controlling for education, experience, and other confounders.

Perpetual’s DMLEstimator handles cross-fitting automatically and uses a custom DML objective (mirroring the Rust DMLObjective) for the final effect model.

[ ]:

import numpy as np
import pandas as pd
from perpetual.dml import DMLEstimator
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Load the CPS 1985 Wages Dataset

The Current Population Survey (CPS) 1985 dataset contains information about workers’ wages, education, experience, and demographics.

[ ]:

print("Fetching CPS 1985 Wages dataset...")
data = fetch_openml(data_id=534, as_frame=True, parser="auto")
df = data.frame
print(f"Shape: {df.shape}")
df.head()

2. Prepare Features

We encode the treatment (gender) as binary and prepare covariates.

[ ]:

# Encode categorical variables
df_encoded = pd.get_dummies(
    df,
    columns=["SOUTH", "SEX", "UNION", "RACE", "OCCUPATION", "SECTOR", "MARR"],
    drop_first=True,
    dtype=float,
)

# Treatment: being female (1 = female, 0 = male)
w = (
    df_encoded["sex_female"].values
    if "sex_female" in df_encoded.columns
    else (
        1.0 - df_encoded["sex_male"].values
        if "sex_male" in df_encoded.columns
        else df["SEX"].map({"female": 1, "male": 0}).values
    )
)

# Outcome: log wage (log transform for normality)
y = np.log1p(df_encoded["WAGE"].values)

# Covariates: everything except wage and the treatment column
drop_cols = [c for c in df_encoded.columns if c in ["WAGE", "sex_female", "sex_male"]]
X = df_encoded.drop(columns=drop_cols).values.astype(float)
feature_names = [c for c in df_encoded.columns if c not in drop_cols]

w = np.asarray(w, dtype=float)

3. Train/Test Split

[ ]:

X_train, X_test, w_train, w_test, y_train, y_test = train_test_split(
    X, w, y, test_size=0.3, random_state=42
)
print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")
print(f"w_train mean: {w_train.mean():.2f}, w_test mean: {w_test.mean():.2f}")
print(f"y_train mean: {y_train.mean():.2f}, y_test mean: {y_test.mean():.2f}")

4. Fit the DML Estimator

The DMLEstimator performs cross-fitting internally:

Fits an outcome nuisance model \(g(X) \approx E[Y|X]\) on each fold.
Fits a treatment nuisance model \(m(X) \approx E[W|X]\) on each fold.
Computes orthogonalized residuals \(\tilde{Y}\) and \(\tilde{W}\).
Fits the effect model using a DML-specific custom objective.

[ ]:

dml = DMLEstimator(budget=2.0, n_folds=5)
dml.fit(X_train, w_train, y_train)
print("DML model fitted.")

5. Estimate Heterogeneous Treatment Effects

The predicted CATE represents how much the wage (in log scale) changes due to being female, for each individual.

[ ]:

cate_test = dml.predict(X_test)

print(f"Average Treatment Effect (ATE): {cate_test.mean():.4f}")
print(f"  (in wage terms: {np.expm1(cate_test.mean()):.2%} change)")
print(f"Median CATE: {np.median(cate_test):.4f}")
print(f"Std of CATE: {cate_test.std():.4f}")
print(f"Range: [{cate_test.min():.4f}, {cate_test.max():.4f}]")

6. Feature Importance

Which features drive heterogeneity in the gender wage gap?

[ ]:

importances = dml.feature_importances_
top_k = 10
top_idx = np.argsort(importances)[::-1][:top_k]

print(f"\nTop {top_k} features driving CATE heterogeneity:")
for rank, idx in enumerate(top_idx, 1):
    print(f"  {rank}. {feature_names[idx]:25s}  importance={importances[idx]:.4f}")

7. Compare with Naive Estimate

A naive comparison of means ignores confounders. DML accounts for differences in education, experience, sector, etc.

[ ]:

naive_ate = y_test[w_test == 1].mean() - y_test[w_test == 0].mean()
dml_ate = cate_test.mean()

print(f"Naive ATE (difference in means): {naive_ate:.4f}")
print(f"DML ATE (cross-fitted):          {dml_ate:.4f}")
print("\nThe DML estimate accounts for confounders like education and experience.")

8. Subgroup Analysis

Examine how the treatment effect varies across subgroups.

[ ]:

# Split by median CATE
median_cate = np.median(cate_test)
high_effect = cate_test >= median_cate
low_effect = cate_test < median_cate

print(
    f"Subgroup with higher wage gap:  mean CATE = {cate_test[high_effect].mean():.4f}"
)
print(f"Subgroup with lower wage gap:   mean CATE = {cate_test[low_effect].mean():.4f}")

[ ]:

# Linear regression comparison for treatment effect
from sklearn.linear_model import LinearRegression

# Fit model: y ~ w + X
X_lr = np.column_stack([w_train, X_train])
lr = LinearRegression()
lr.fit(X_lr, y_train)

# Coefficient for treatment (w)
treatment_coef = lr.coef_[0]
print(f"Linear regression treatment effect (log scale): {treatment_coef:.4f}")
print(f"  (in wage terms: {np.expm1(treatment_coef):.2%} change)")

Comparison to Linear Regression and DML Advantages

The linear regression model estimates the average treatment effect (ATE) by fitting a single coefficient for the treatment (gender), assuming the effect is constant across all individuals. This approach is simple and interpretable, but it cannot capture heterogeneity in treatment effects or account for complex confounding.

Our Double Machine Learning (DML) implementation, by contrast, estimates heterogeneous treatment effects (CATE) for each individual, leveraging cross-fitting and flexible nuisance models. DML is robust to high-dimensional confounders and avoids overfitting by separating the estimation of nuisance functions from the effect model. This allows for more accurate and nuanced causal inference, especially when treatment effects vary across subgroups or covariate patterns.

Advantages of DML:

Estimates individual-level (heterogeneous) treatment effects, not just a single average.
Robust to high-dimensional confounders and flexible feature sets.
Uses cross-fitting to reduce bias and overfitting.
Provides feature importance for understanding drivers of effect heterogeneity.
More reliable causal estimates in complex, real-world data.

Summary and References

This notebook demonstrated Double Machine Learning (DML) for estimating the gender wage gap using the CPS 1985 dataset. Key steps included:

Using DML to estimate heterogeneous causal effects of gender on wages.
Leveraging cross-fitting to avoid overfitting nuisance models.
Comparing DML results to naive difference-in-means and linear regression.
Identifying features that drive variation in the wage gap.

References:

Chernozhukov, V. et al. (2018). Double/Debiased Machine Learning for Treatment and Structural Parameters. The Econometrics Journal, 21(1).