Policy Learning: Optimal Treatment Assignment

Policy learning aims to find the optimal treatment rule \(\pi(X) \in \{0, 1\}\) that maximizes expected outcomes in a population.

Perpetual’s PolicyLearner implements the Athey & Wager (2021) framework using inverse propensity weighting (IPW) to transform the causal inference problem into a weighted classification task.

Two modes are available:

  • IPW — standard inverse propensity weighting.

  • AIPW (doubly robust) — incorporates a baseline outcome model to reduce variance.

In this tutorial we use the Bank Marketing dataset to learn an optimal policy for targeting customers with a marketing campaign.

[ ]:
import numpy as np
import pandas as pd
from perpetual.causal_metrics import auuc, qini_coefficient
from perpetual.policy import PolicyLearner
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

1. Load the Bank Marketing Dataset

The Bank Marketing dataset records whether clients subscribed to a term deposit after being contacted in a marketing campaign. We treat the phone contact as the “treatment” and subscription as the “outcome”.

[ ]:
print("Fetching Bank Marketing dataset...")
data = fetch_openml(data_id=1461, as_frame=True, parser="auto")
df = data.frame
print(f"Shape: {df.shape}")
df.head()

2. Simulate an RCT

Since the original dataset is observational, we simulate a randomized controlled trial (RCT) by randomly assigning treatment. This gives us known propensity scores (0.5) for clean evaluation.

[ ]:
np.random.seed(42)

# Encode target
y_col = df.columns[-1]
y_all = (
    (df[y_col].astype(str).str.strip().isin(["1", "2", "yes", "Yes"]))
    .astype(float)
    .values
)

# Encode features
feature_cols = [c for c in df.columns if c != y_col]
df_feat = df[feature_cols].copy()
cat_cols = df_feat.select_dtypes(include=["category", "object"]).columns.tolist()
df_encoded = pd.get_dummies(df_feat, columns=cat_cols, drop_first=True, dtype=float)
X_all = df_encoded.values.astype(float)
feature_names = list(df_encoded.columns)

# Simulate RCT: random treatment with P(W=1)=0.5
w_all = np.random.binomial(1, 0.5, size=len(y_all)).astype(float)

# Simulate heterogeneous treatment effect based on age and balance
age_idx = feature_names.index("age") if "age" in feature_names else 0
balance_idx = feature_names.index("balance") if "balance" in feature_names else 1

# Younger clients with higher balance benefit more from contact
age_norm = (X_all[:, age_idx] - X_all[:, age_idx].mean()) / (
    X_all[:, age_idx].std() + 1e-8
)
bal_norm = (X_all[:, balance_idx] - X_all[:, balance_idx].mean()) / (
    X_all[:, balance_idx].std() + 1e-8
)
true_cate = 0.1 - 0.05 * age_norm + 0.05 * bal_norm

# Outcome: base rate + treatment effect
base_rate = 0.12
prob_y = np.clip(base_rate + w_all * true_cate, 0.01, 0.99)
y_sim = np.random.binomial(1, prob_y).astype(float)

print(f"X shape: {X_all.shape}")
print(f"Treatment rate: {w_all.mean():.2%}")
print(f"Outcome rate: {y_sim.mean():.2%}")
print(f"True ATE: {true_cate.mean():.4f}")
[ ]:
X_train, X_test, w_train, w_test, y_train, y_test, cate_train, cate_test = (
    train_test_split(X_all, w_all, y_sim, true_cate, test_size=0.3, random_state=42)
)
print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")

3. Learn an IPW Policy

The IPW PolicyLearner computes pseudo-outcomes from the propensity scores and outcome, then learns a policy that assigns treatment when the pseudo-outcome is positive.

[ ]:
# Since we simulated an RCT, we know the propensity is 0.5 for all samples.
prop_train = np.full(len(w_train), 0.5)

pl_ipw = PolicyLearner(budget=0.5, mode="ipw")
pl_ipw.fit(X_train, w_train, y_train, propensity=prop_train)

policy_ipw = pl_ipw.predict(X_test)
print(f"IPW Policy: treat {policy_ipw.mean():.2%} of the population")

4. Learn an AIPW (Doubly Robust) Policy

AIPW reduces variance by incorporating a baseline outcome model.

[ ]:
pl_aipw = PolicyLearner(budget=0.5, mode="aipw")
pl_aipw.fit(X_train, w_train, y_train, propensity=prop_train)

policy_aipw = pl_aipw.predict(X_test)
print(f"AIPW Policy: treat {policy_aipw.mean():.2%} of the population")

5. Evaluate Policies

We compare the learned policies against random treatment and treat-everyone baselines using the true CATE.

[ ]:
def policy_value(policy, true_cate):
    """Expected improvement from a targeting policy vs. no treatment."""
    return (policy * true_cate).mean()


# Baselines
treat_all = np.ones(len(cate_test))
treat_none = np.zeros(len(cate_test))
random_policy = np.random.binomial(1, 0.5, size=len(cate_test))

# Oracle: treat only when CATE > 0
oracle = (cate_test > 0).astype(int)

print(f"{'Policy':<25} {'Value':>10} {'Treat %':>10}")
print("-" * 48)
for name, pol in [
    ("Treat Everyone", treat_all),
    ("Treat Nobody", treat_none),
    ("Random (50%)", random_policy),
    ("IPW PolicyLearner", policy_ipw),
    ("AIPW PolicyLearner", policy_aipw),
    ("Oracle (true CATE>0)", oracle),
]:
    val = policy_value(pol, cate_test)
    pct = pol.mean()
    print(f"{name:<25} {val:>10.4f} {pct:>10.2%}")

6. Feature Importance

Which features matter most for the treatment-assignment decision?

[ ]:
importances = pl_aipw.feature_importances_
top_k = 10
top_idx = np.argsort(importances)[::-1][:top_k]

print(f"Top {top_k} features for policy assignment:")
for rank, idx in enumerate(top_idx, 1):
    print(f"  {rank}. {feature_names[idx]:25s}  importance={importances[idx]:.4f}")

7. Uplift Curve Evaluation

We can also use causal_metrics to evaluate the policy score as an uplift model, checking whether higher-scored individuals benefit more.

[ ]:
# Use the continuous policy score as the uplift score
uplift_score = pl_aipw.predict_proba(X_test)

aipw_auuc = auuc(y_test, w_test, uplift_score)
aipw_qini = qini_coefficient(y_test, w_test, uplift_score)

# Compare with random
random_score = np.random.randn(len(y_test))
rand_auuc = auuc(y_test, w_test, random_score)

print(f"AIPW PolicyLearner — AUUC: {aipw_auuc:.4f}, Qini: {aipw_qini:.4f}")
print(f"Random baseline    — AUUC: {rand_auuc:.4f}")

Summary

In this tutorial we:

  1. Simulated an RCT with heterogeneous treatment effects on real-world Bank Marketing features.

  2. Trained IPW and AIPW policy learners to find optimal treatment rules.

  3. Compared learned policies against baselines and the oracle.

  4. Identified the most important features for treatment assignment.

  5. Evaluated the policy as an uplift model using AUUC and Qini.

References

  • Athey, S. & Wager, S. (2021). Policy Learning with Observational Data. Econometrica, 89(1), 133-161.