Scikit-Learn Interface: Classification, Regression & Ranking
Perpetual provides drop-in scikit-learn compatible estimators:
PerpetualClassifier— classification withpredict,predict_proba, and full sklearnClassifierMixinsupport.PerpetualRegressor— regression withRegressorMixinsupport.PerpetualRanker— learning-to-rank.
These wrappers plug directly into sklearn pipelines, cross-validation, and model selection utilities.
In this tutorial we demonstrate all three on real-world datasets.
[ ]:
import numpy as np
from perpetual.sklearn import PerpetualClassifier, PerpetualRanker, PerpetualRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import accuracy_score, r2_score, roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
Part 1: Classification with PerpetualClassifier
We use the Covertype dataset (forest cover type prediction) from sklearn — a large multi-class classification task with 581K samples.
[ ]:
from sklearn.datasets import fetch_covtype
print("Fetching Cover Type dataset...")
cov = fetch_covtype(as_frame=True)
X_cov = cov.data.values.astype(float)
y_cov = (cov.target.values == 1).astype(float) # Binary: type 1 vs rest
X_tr, X_te, y_tr, y_te = train_test_split(
X_cov, y_cov, test_size=0.2, random_state=42, stratify=y_cov
)
print(f"Train: {X_tr.shape}, Test: {X_te.shape}")
print(f"Positive class rate: {y_cov.mean():.2%}")
[ ]:
clf = PerpetualClassifier(budget=0.5)
clf.fit(X_tr, y_tr)
y_pred = clf.predict(X_te)
y_prob = clf.predict_proba(X_te)[:, 1]
print(f"Accuracy: {accuracy_score(y_te, y_pred):.4f}")
print(f"AUC: {roc_auc_score(y_te, y_prob):.4f}")
Cross-Validation
Since PerpetualClassifier is a proper sklearn estimator, it works seamlessly with cross_val_score.
[ ]:
# Use a smaller subset for faster CV
X_sub, _, y_sub, _ = train_test_split(
X_cov, y_cov, train_size=10000, random_state=42, stratify=y_cov
)
scores = cross_val_score(
PerpetualClassifier(budget=0.4), X_sub, y_sub, cv=5, scoring="roc_auc"
)
print(f"5-Fold CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")
Pipeline Integration
[ ]:
pipe = Pipeline(
[
("scaler", StandardScaler()),
("clf", PerpetualClassifier(budget=0.5)),
]
)
pipe.fit(X_tr[:5000], y_tr[:5000])
pipe_preds = pipe.predict(X_te[:1000])
print(f"Pipeline accuracy (subset): {accuracy_score(y_te[:1000], pipe_preds):.4f}")
Part 2: Regression with PerpetualRegressor
We use the California Housing dataset — a classic regression benchmark predicting median house value.
[ ]:
print("Fetching California Housing dataset...")
housing = fetch_california_housing(as_frame=True)
X_h = housing.data.values
y_h = housing.target.values
X_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
print(f"Train: {X_tr_h.shape}, Test: {X_te_h.shape}")
[ ]:
reg = PerpetualRegressor(budget=0.5)
reg.fit(X_tr_h, y_tr_h)
y_pred_h = reg.predict(X_te_h)
print(f"R² Score: {r2_score(y_te_h, y_pred_h):.4f}")
print(f"RMSE: {np.sqrt(np.mean((y_te_h - y_pred_h) ** 2)):.4f}")
Cross-Validation for Regression
[ ]:
scores_r2 = cross_val_score(
PerpetualRegressor(budget=0.5), X_h, y_h, cv=5, scoring="r2"
)
print(f"5-Fold CV R²: {scores_r2.mean():.4f} ± {scores_r2.std():.4f}")
Part 3: Ranking with PerpetualRanker
Learning-to-rank models optimize the ordering of items within groups. We demonstrate with a synthetic search-relevance task.
[ ]:
np.random.seed(42)
n_queries = 200
docs_per_query = 20
n_features = 10
n_total = n_queries * docs_per_query
X_rank = np.random.randn(n_total, n_features)
# Relevance depends on first 3 features + noise
relevance = (
2 * X_rank[:, 0]
+ X_rank[:, 1]
- 0.5 * X_rank[:, 2]
+ np.random.randn(n_total) * 0.5
)
# Convert to rank labels 0-4
y_rank = np.clip(np.round(relevance + 2), 0, 4).astype(float)
# Group: each query has `docs_per_query` documents
qid = np.repeat(np.arange(n_queries), docs_per_query)
# Split by query
train_q = qid < 140
test_q = ~train_q
print(
f"Train queries: {np.unique(qid[train_q]).shape[0]}, "
f"Test queries: {np.unique(qid[test_q]).shape[0]}"
)
[ ]:
ranker = PerpetualRanker(budget=0.5)
ranker.fit(X_rank[train_q], y_rank[train_q], qid=qid[train_q])
scores_rank = ranker.predict(X_rank[test_q])
print(f"Predicted score range: [{scores_rank.min():.2f}, {scores_rank.max():.2f}]")
print(
f"Correlation with true relevance: {np.corrcoef(y_rank[test_q], scores_rank)[0, 1]:.4f}"
)
NDCG Evaluation
[ ]:
from sklearn.metrics import ndcg_score
# Evaluate per query, then average
test_qids = np.unique(qid[test_q])
ndcgs = []
for q in test_qids:
mask = qid[test_q] == q
if y_rank[test_q][mask].max() > 0: # Skip all-zero queries
ndcgs.append(
ndcg_score(
y_rank[test_q][mask].reshape(1, -1),
scores_rank[mask].reshape(1, -1),
k=10,
)
)
print(f"Mean NDCG@10: {np.mean(ndcgs):.4f}")
Summary
Estimator |
Task |
Dataset |
Key Metric |
|---|---|---|---|
|
Binary classification |
Cover Type |
AUC |
|
Regression |
California Housing |
R² |
|
Learning-to-rank |
Synthetic search |
All three estimators:
Work seamlessly with
cross_val_score,Pipeline, and other sklearn utilities.Inherit Perpetual’s self-generalizing behavior (no manual early stopping).
Support
budgetfor controlling model complexity.