{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Scikit-Learn Interface: Classification, Regression & Ranking\n", "\n", "Perpetual provides drop-in scikit-learn compatible estimators:\n", "\n", "- `PerpetualClassifier` — classification with `predict`, `predict_proba`,\n", " and full sklearn `ClassifierMixin` support.\n", "- `PerpetualRegressor` — regression with `RegressorMixin` support.\n", "- `PerpetualRanker` — learning-to-rank.\n", "\n", "These wrappers plug directly into sklearn pipelines, cross-validation,\n", "and model selection utilities.\n", "\n", "In this tutorial we demonstrate all three on real-world datasets." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from perpetual.sklearn import PerpetualClassifier, PerpetualRanker, PerpetualRegressor\n", "from sklearn.datasets import fetch_california_housing\n", "from sklearn.metrics import accuracy_score, r2_score, roc_auc_score\n", "from sklearn.model_selection import cross_val_score, train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "---\n", "## Part 1: Classification with PerpetualClassifier\n", "\n", "We use the **Covertype** dataset (forest cover type prediction) from\n", "sklearn — a large multi-class classification task with 581K samples." ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_covtype\n", "\n", "print(\"Fetching Cover Type dataset...\")\n", "cov = fetch_covtype(as_frame=True)\n", "X_cov = cov.data.values.astype(float)\n", "y_cov = (cov.target.values == 1).astype(float) # Binary: type 1 vs rest\n", "\n", "X_tr, X_te, y_tr, y_te = train_test_split(\n", " X_cov, y_cov, test_size=0.2, random_state=42, stratify=y_cov\n", ")\n", "print(f\"Train: {X_tr.shape}, Test: {X_te.shape}\")\n", "print(f\"Positive class rate: {y_cov.mean():.2%}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "clf = PerpetualClassifier(budget=0.5)\n", "clf.fit(X_tr, y_tr)\n", "\n", "y_pred = clf.predict(X_te)\n", "y_prob = clf.predict_proba(X_te)[:, 1]\n", "\n", "print(f\"Accuracy: {accuracy_score(y_te, y_pred):.4f}\")\n", "print(f\"AUC: {roc_auc_score(y_te, y_prob):.4f}\")" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "### Cross-Validation\n", "\n", "Since `PerpetualClassifier` is a proper sklearn estimator, it works\n", "seamlessly with `cross_val_score`." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "# Use a smaller subset for faster CV\n", "X_sub, _, y_sub, _ = train_test_split(\n", " X_cov, y_cov, train_size=10000, random_state=42, stratify=y_cov\n", ")\n", "\n", "scores = cross_val_score(\n", " PerpetualClassifier(budget=0.4), X_sub, y_sub, cv=5, scoring=\"roc_auc\"\n", ")\n", "print(f\"5-Fold CV AUC: {scores.mean():.4f} ± {scores.std():.4f}\")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "### Pipeline Integration" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline(\n", " [\n", " (\"scaler\", StandardScaler()),\n", " (\"clf\", PerpetualClassifier(budget=0.5)),\n", " ]\n", ")\n", "pipe.fit(X_tr[:5000], y_tr[:5000])\n", "pipe_preds = pipe.predict(X_te[:1000])\n", "print(f\"Pipeline accuracy (subset): {accuracy_score(y_te[:1000], pipe_preds):.4f}\")" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "---\n", "## Part 2: Regression with PerpetualRegressor\n", "\n", "We use the **California Housing** dataset — a classic regression\n", "benchmark predicting median house value." ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "print(\"Fetching California Housing dataset...\")\n", "housing = fetch_california_housing(as_frame=True)\n", "X_h = housing.data.values\n", "y_h = housing.target.values\n", "\n", "X_tr_h, X_te_h, y_tr_h, y_te_h = train_test_split(\n", " X_h, y_h, test_size=0.2, random_state=42\n", ")\n", "print(f\"Train: {X_tr_h.shape}, Test: {X_te_h.shape}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "reg = PerpetualRegressor(budget=0.5)\n", "reg.fit(X_tr_h, y_tr_h)\n", "\n", "y_pred_h = reg.predict(X_te_h)\n", "print(f\"R² Score: {r2_score(y_te_h, y_pred_h):.4f}\")\n", "print(f\"RMSE: {np.sqrt(np.mean((y_te_h - y_pred_h) ** 2)):.4f}\")" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "### Cross-Validation for Regression" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "scores_r2 = cross_val_score(\n", " PerpetualRegressor(budget=0.5), X_h, y_h, cv=5, scoring=\"r2\"\n", ")\n", "print(f\"5-Fold CV R²: {scores_r2.mean():.4f} ± {scores_r2.std():.4f}\")" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "---\n", "## Part 3: Ranking with PerpetualRanker\n", "\n", "Learning-to-rank models optimize the ordering of items within groups.\n", "We demonstrate with a synthetic search-relevance task." ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "np.random.seed(42)\n", "\n", "n_queries = 200\n", "docs_per_query = 20\n", "n_features = 10\n", "n_total = n_queries * docs_per_query\n", "\n", "X_rank = np.random.randn(n_total, n_features)\n", "\n", "# Relevance depends on first 3 features + noise\n", "relevance = (\n", " 2 * X_rank[:, 0]\n", " + X_rank[:, 1]\n", " - 0.5 * X_rank[:, 2]\n", " + np.random.randn(n_total) * 0.5\n", ")\n", "# Convert to rank labels 0-4\n", "y_rank = np.clip(np.round(relevance + 2), 0, 4).astype(float)\n", "\n", "# Group: each query has `docs_per_query` documents\n", "qid = np.repeat(np.arange(n_queries), docs_per_query)\n", "\n", "# Split by query\n", "train_q = qid < 140\n", "test_q = ~train_q\n", "\n", "print(\n", " f\"Train queries: {np.unique(qid[train_q]).shape[0]}, \"\n", " f\"Test queries: {np.unique(qid[test_q]).shape[0]}\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "ranker = PerpetualRanker(budget=0.5)\n", "ranker.fit(X_rank[train_q], y_rank[train_q], qid=qid[train_q])\n", "\n", "scores_rank = ranker.predict(X_rank[test_q])\n", "print(f\"Predicted score range: [{scores_rank.min():.2f}, {scores_rank.max():.2f}]\")\n", "print(\n", " f\"Correlation with true relevance: {np.corrcoef(y_rank[test_q], scores_rank)[0, 1]:.4f}\"\n", ")" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "### NDCG Evaluation" ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import ndcg_score\n", "\n", "# Evaluate per query, then average\n", "test_qids = np.unique(qid[test_q])\n", "ndcgs = []\n", "for q in test_qids:\n", " mask = qid[test_q] == q\n", " if y_rank[test_q][mask].max() > 0: # Skip all-zero queries\n", " ndcgs.append(\n", " ndcg_score(\n", " y_rank[test_q][mask].reshape(1, -1),\n", " scores_rank[mask].reshape(1, -1),\n", " k=10,\n", " )\n", " )\n", "\n", "print(f\"Mean NDCG@10: {np.mean(ndcgs):.4f}\")" ] }, { "cell_type": "markdown", "id": "19", "metadata": {}, "source": [ "---\n", "## Summary\n", "\n", "| Estimator | Task | Dataset | Key Metric |\n", "|-----------|------|---------|------------|\n", "| `PerpetualClassifier` | Binary classification | Cover Type | AUC |\n", "| `PerpetualRegressor` | Regression | California Housing | R² |\n", "| `PerpetualRanker` | Learning-to-rank | Synthetic search | NDCG@10 |\n", "\n", "All three estimators:\n", "- Work seamlessly with `cross_val_score`, `Pipeline`, and other sklearn utilities.\n", "- Inherit Perpetual's **self-generalizing** behavior (no manual early stopping).\n", "- Support `budget` for controlling model complexity." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }