Scikit-Learn Interface: Classification, Regression & Ranking

Perpetual provides drop-in scikit-learn compatible estimators:

  • PerpetualClassifier — classification with predict, predict_proba, and full sklearn ClassifierMixin support.

  • PerpetualRegressor — regression with RegressorMixin support.

  • PerpetualRanker — learning-to-rank.

These wrappers plug directly into sklearn pipelines, cross-validation, and model selection utilities.

In this tutorial we demonstrate all three on real-world datasets.

[ ]:
import numpy as np
from perpetual.sklearn import PerpetualClassifier, PerpetualRanker, PerpetualRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import accuracy_score, r2_score, roc_auc_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Part 1: Classification with PerpetualClassifier

We use the Covertype dataset (forest cover type prediction) from sklearn — a large multi-class classification task with 581K samples.

[ ]:
from sklearn.datasets import fetch_covtype

print("Fetching Cover Type dataset...")
cov = fetch_covtype(as_frame=True)
X_cov = cov.data.values.astype(float)
y_cov = (cov.target.values == 1).astype(float)  # Binary: type 1 vs rest

X_tr, X_te, y_tr, y_te = train_test_split(
    X_cov, y_cov, test_size=0.2, random_state=42, stratify=y_cov
)
print(f"Train: {X_tr.shape}, Test: {X_te.shape}")
print(f"Positive class rate: {y_cov.mean():.2%}")
[ ]:
clf = PerpetualClassifier(budget=0.5)
clf.fit(X_tr, y_tr)

y_pred = clf.predict(X_te)
y_prob = clf.predict_proba(X_te)[:, 1]

print(f"Accuracy: {accuracy_score(y_te, y_pred):.4f}")
print(f"AUC:      {roc_auc_score(y_te, y_prob):.4f}")

Cross-Validation

Since PerpetualClassifier is a proper sklearn estimator, it works seamlessly with cross_val_score.

[ ]:
# Use a smaller subset for faster CV
X_sub, _, y_sub, _ = train_test_split(
    X_cov, y_cov, train_size=10000, random_state=42, stratify=y_cov
)

scores = cross_val_score(
    PerpetualClassifier(budget=0.4), X_sub, y_sub, cv=5, scoring="roc_auc"
)
print(f"5-Fold CV AUC: {scores.mean():.4f} ± {scores.std():.4f}")

Pipeline Integration

[ ]:
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("clf", PerpetualClassifier(budget=0.5)),
    ]
)
pipe.fit(X_tr[:5000], y_tr[:5000])
pipe_preds = pipe.predict(X_te[:1000])
print(f"Pipeline accuracy (subset): {accuracy_score(y_te[:1000], pipe_preds):.4f}")

Part 2: Regression with PerpetualRegressor

We use the California Housing dataset — a classic regression benchmark predicting median house value.

[ ]:
print("Fetching California Housing dataset...")
housing = fetch_california_housing(as_frame=True)
X_h = housing.data.values
y_h = housing.target.values

X_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)
print(f"Train: {X_tr_h.shape}, Test: {X_te_h.shape}")
[ ]:
reg = PerpetualRegressor(budget=0.5)
reg.fit(X_tr_h, y_tr_h)

y_pred_h = reg.predict(X_te_h)
print(f"R² Score: {r2_score(y_te_h, y_pred_h):.4f}")
print(f"RMSE:     {np.sqrt(np.mean((y_te_h - y_pred_h) ** 2)):.4f}")

Cross-Validation for Regression

[ ]:
scores_r2 = cross_val_score(
    PerpetualRegressor(budget=0.5), X_h, y_h, cv=5, scoring="r2"
)
print(f"5-Fold CV R²: {scores_r2.mean():.4f} ± {scores_r2.std():.4f}")

Part 3: Ranking with PerpetualRanker

Learning-to-rank models optimize the ordering of items within groups. We demonstrate with a synthetic search-relevance task.

[ ]:
np.random.seed(42)

n_queries = 200
docs_per_query = 20
n_features = 10
n_total = n_queries * docs_per_query

X_rank = np.random.randn(n_total, n_features)

# Relevance depends on first 3 features + noise
relevance = (
    2 * X_rank[:, 0]
    + X_rank[:, 1]
    - 0.5 * X_rank[:, 2]
    + np.random.randn(n_total) * 0.5
)
# Convert to rank labels 0-4
y_rank = np.clip(np.round(relevance + 2), 0, 4).astype(float)

# Group: each query has `docs_per_query` documents
qid = np.repeat(np.arange(n_queries), docs_per_query)

# Split by query
train_q = qid < 140
test_q = ~train_q

print(
    f"Train queries: {np.unique(qid[train_q]).shape[0]}, "
    f"Test queries: {np.unique(qid[test_q]).shape[0]}"
)
[ ]:
ranker = PerpetualRanker(budget=0.5)
ranker.fit(X_rank[train_q], y_rank[train_q], qid=qid[train_q])

scores_rank = ranker.predict(X_rank[test_q])
print(f"Predicted score range: [{scores_rank.min():.2f}, {scores_rank.max():.2f}]")
print(
    f"Correlation with true relevance: {np.corrcoef(y_rank[test_q], scores_rank)[0, 1]:.4f}"
)

NDCG Evaluation

[ ]:
from sklearn.metrics import ndcg_score

# Evaluate per query, then average
test_qids = np.unique(qid[test_q])
ndcgs = []
for q in test_qids:
    mask = qid[test_q] == q
    if y_rank[test_q][mask].max() > 0:  # Skip all-zero queries
        ndcgs.append(
            ndcg_score(
                y_rank[test_q][mask].reshape(1, -1),
                scores_rank[mask].reshape(1, -1),
                k=10,
            )
        )

print(f"Mean NDCG@10: {np.mean(ndcgs):.4f}")

Summary

Estimator

Task

Dataset

Key Metric

PerpetualClassifier

Binary classification

Cover Type

AUC

PerpetualRegressor

Regression

California Housing

PerpetualRanker

Learning-to-rank

Synthetic search

NDCG@10

All three estimators:

  • Work seamlessly with cross_val_score, Pipeline, and other sklearn utilities.

  • Inherit Perpetual’s self-generalizing behavior (no manual early stopping).

  • Support budget for controlling model complexity.