Perpetual
Python API Reference
The PerpetualBooster
class is currently the only public facing class in the package, and can be used to train gradient boosted decision tree ensembles with multiple objective functions.
PerpetualBooster
PerpetualBooster(*, objective: str = 'LogLoss', num_threads: Optional[int] = None, monotone_constraints: Union[Dict[Any, int], None] = None, force_children_to_bound_parent: bool = False, missing: float = np.nan, allow_missing_splits: bool = True, create_missing_branch: bool = False, terminate_missing_features: Optional[Iterable[Any]] = None, missing_node_treatment: str = 'None', log_iterations: int = 0, feature_importance_method: str = 'Gain', budget: Optional[float] = None, alpha: Optional[float] = None, reset: Optional[bool] = None, categorical_features: Union[Iterable[int], Iterable[str], str, None] = 'auto', timeout: Optional[float] = None, iteration_limit: Optional[int] = None, memory_limit: Optional[float] = None, stopping_rounds: Optional[int] = None, max_bin: int = 256, max_cat: int = 1000)
PerpetualBooster class, used to generate gradient boosted decision tree ensembles. The following parameters can also be specified in the fit method to override the values in the constructor: budget, alpha, reset, categorical_features, timeout, iteration_limit, memory_limit, and stopping_rounds.
Parameters:
-
objective
(str
, default:'LogLoss'
) –Learning objective function to be used for optimization. Valid options include "LogLoss" to use logistic loss (classification), "SquaredLoss" to use squared error (regression), "QuantileLoss" to use quantile error (regression). Defaults to "LogLoss".
-
num_threads
(int
, default:None
) –Number of threads to be used during training.
-
monotone_constraints
(Dict[Any, int]
, default:None
) –Constraints that are used to enforce a specific relationship between the training features and the target variable. A dictionary should be provided where the keys are the feature index value if the model will be fit on a numpy array, or a feature name if it will be fit on a Dataframe. The values of the dictionary should be an integer value of -1, 1, or 0 to specify the relationship that should be estimated between the respective feature and the target variable. Use a value of -1 to enforce a negative relationship, 1 a positive relationship, and 0 will enforce no specific relationship at all. Features not included in the mapping will not have any constraint applied. If
None
is passed, no constraints will be enforced on any variable. Defaults toNone
. -
force_children_to_bound_parent
(bool
, default:False
) –Setting this parameter to
True
will restrict children nodes, so that they always contain the parent node inside of their range. Without setting this it's possible that both, the left and the right nodes could be greater, than or less than, the parent node. Defaults toFalse
. -
missing
(float
, default:nan
) –Value to consider missing, when training and predicting with the booster. Defaults to
np.nan
. -
allow_missing_splits
(bool
, default:True
) –Allow for splits to be made such that all missing values go down one branch, and all non-missing values go down the other, if this results in the greatest reduction of loss. If this is false, splits will only be made on non missing values. If
create_missing_branch
is set toTrue
having this parameter be set toTrue
will result in the missing branch further split, if this parameter isFalse
then in that case the missing branch will always be a terminal node. Defaults toTrue
. -
create_missing_branch
(bool
, default:False
) –An experimental parameter, that if
True
, will create a separate branch for missing, creating a ternary tree, the missing node will be given the same weight value as the parent node. If this parameter isFalse
, missing will be sent down either the left or right branch, creating a binary tree. Defaults toFalse
. -
terminate_missing_features
(Set[Any]
, default:None
) –An optional iterable of features (either strings, or integer values specifying the feature indices if numpy arrays are used for fitting), for which the missing node will always be terminated, even if
allow_missing_splits
is set to true. This value is only valid ifcreate_missing_branch
is also True. -
missing_node_treatment
(str
, default:'None'
) –Method for selecting the
weight
for the missing node, ifcreate_missing_branch
is set toTrue
. Defaults to "None". Valid options are: - "None": Calculate missing node weight values without any constraints. - "AssignToParent": Assign the weight of the missing node to that of the parent. - "AverageLeafWeight": After training each tree, starting from the bottom of the tree, assign the missing node weight to the weighted average of the left and right child nodes. Next assign the parent to the weighted average of the children nodes. This is performed recursively up through the entire tree. This is performed as a post processing step on each tree after it is built, and prior to updating the predictions for which to train the next tree. - "AverageNodeWeight": Set the missing node to be equal to the weighted average weight of the left and the right nodes. -
log_iterations
(int
, default:0
) – -
feature_importance_method
(str
, default:'Gain'
) –The feature importance method type that will be used to calculate the
feature_importances_
attribute on the booster. -
budget
(float
, default:None
) –a positive number for fitting budget. Increasing this number will more likely result in more boosting rounds and more increased predictive power. Default value is 1.0.
-
alpha
(float
, default:None
) –only used in quantile regression.
-
reset
(bool
, default:None
) –whether to reset the model or continue training.
-
categorical_features
(Union[Iterable[int], Iterable[str], str, None]
, default:'auto'
) –The names or indices for categorical features. Defaults to
auto
for Polars or Pandas categorical data types. -
timeout
(float
, default:None
) –optional fit timeout in seconds
-
iteration_limit
(int
, default:None
) –optional limit for the number of boosting rounds. The default value is 1000 boosting rounds. The algorithm automatically stops for most of the cases before hitting this limit. If you want to experiment with very high budget (>2.0), you can also increase this limit.
-
memory_limit
(float
, default:None
) –optional limit for memory allocation in GB. If not set, the memory will be allocated based on available memory and the algorithm requirements.
-
stopping_rounds
(int
, default:None
) –optional limit for auto stopping.
-
max_bin
(int
, default:256
) –maximum number of bins for feature discretization. Defaults to 256.
-
max_cat
(int
, default:1000
) –Maximum number of unique categories for a categorical feature. Features with more categories will be treated as numerical. Defaults to 1000.
Raises:
-
TypeError
–Raised if an invalid dtype is passed.
Example
Once, the booster has been initialized, it can be fit on a provided dataset, and performance field. After fitting, the model can be used to predict on a dataset. In the case of this example, the predictions are the log odds of a given record being 1.
# Small example dataset
from seaborn import load_dataset
df = load_dataset("titanic")
X = df.select_dtypes("number").drop(columns=["survived"])
y = df["survived"]
# Initialize a booster with defaults.
from perpetual import PerpetualBooster
model = PerpetualBooster(objective="LogLoss")
model.fit(X, y)
# Predict on data
model.predict(X.head())
# array([-1.94919663, 2.25863229, 0.32963671, 2.48732194, -3.00371813])
# predict contributions
model.predict_contributions(X.head())
# array([[-0.63014213, 0.33880048, -0.16520798, -0.07798772, -0.85083578,
# -1.07720813],
# [ 1.05406709, 0.08825999, 0.21662544, -0.12083538, 0.35209258,
# -1.07720813],
base_score
property
Base score of the model.
Returns:
-
Union[float, Iterable[float]]
–Union[float, Iterable[float]]: Base score(s) of the model.
number_of_trees
property
The number of trees in the model.
Returns:
-
int
(Union[int, Iterable[int]]
) –The total number of trees in the model.
fit
fit(X, y, sample_weight=None, budget: Optional[float] = None, alpha: Optional[float] = None, reset: Optional[bool] = None, categorical_features: Union[Iterable[int], Iterable[str], str, None] = 'auto', timeout: Optional[float] = None, iteration_limit: Optional[int] = None, memory_limit: Optional[float] = None, stopping_rounds: Optional[int] = None) -> Self
Fit the gradient booster on a provided dataset.
Parameters:
-
X
(FrameLike
) –Either a Polars or Pandas DataFrame, or a 2 dimensional Numpy array.
-
y
(Union[FrameLike, ArrayLike]
) –Either a Polars or Pandas DataFrame or Series, or a 1 or 2 dimensional Numpy array.
-
sample_weight
(Union[ArrayLike, None]
, default:None
) –Instance weights to use when training the model. If None is passed, a weight of 1 will be used for every record. Defaults to None.
-
budget
(float
, default:None
) –a positive number for fitting budget. Increasing this number will more likely result in more boosting rounds and more increased predictive power. Defaults to 1.0.
-
alpha
(float
, default:None
) –only used in quantile regression.
-
reset
(bool
, default:None
) –whether to reset the model or continue training.
-
categorical_features
(Union[Iterable[int], Iterable[str], str, None]
, default:'auto'
) –The names or indices for categorical features. Defaults to
auto
for Polars or Pandas categorical data types. -
timeout
(float
, default:None
) –optional fit timeout in seconds
-
iteration_limit
(int
, default:None
) –optional limit for the number of boosting rounds. The default value is 1000 boosting rounds. The algorithm automatically stops for most of the cases before hitting this limit. If you want to experiment with very high budget (>2.0), you can also increase this limit.
-
memory_limit
(float
, default:None
) –optional limit for memory allocation in GB. If not set, the memory will be allocated based on available memory and the algorithm requirements.
-
stopping_rounds
(int
, default:None
) –optional limit for auto stopping. Defaults to 3.
predict
Predict with the fitted booster on new data.
Parameters:
-
X
(FrameLike
) –Either a Polars or Pandas DataFrame, or a 2 dimensional Numpy array.
-
parallel
(Union[bool, None]
, default:None
) –Optionally specify if the predict function should run in parallel on multiple threads. If
None
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
Returns:
-
ndarray
–np.ndarray: Returns a numpy array of the predictions.
predict_proba
Predict probabilities with the fitted booster on new data.
Parameters:
-
X
(FrameLike
) –Either a Polars or Pandas DataFrame, or a 2 dimensional Numpy array.
-
parallel
(Union[bool, None]
, default:None
) –Optionally specify if the predict function should run in parallel on multiple threads. If
None
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
Returns:
-
ndarray
–np.ndarray, shape (n_samples, n_classes): Returns a numpy array of the class probabilities.
predict_log_proba
Predict class log-probabilities with the fitted booster on new data.
Parameters:
-
X
(FrameLike
) –Either a Polars or Pandas DataFrame, or a 2 dimensional Numpy array.
-
parallel
(Union[bool, None]
, default:None
) –Optionally specify if the predict function should run in parallel on multiple threads. If
None
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
Returns:
-
ndarray
–np.ndarray: Returns a numpy array of the predictions.
predict_contributions
Predict with the fitted booster on new data, returning the feature contribution matrix. The last column is the bias term.
Parameters:
-
X
(FrameLike
) –Either a pandas DataFrame, or a 2 dimensional numpy array.
-
method
(str
, default:'Average'
) –Method to calculate the contributions, available options are:
- "Average": If this option is specified, the average internal node values are calculated.
- "Shapley": Using this option will calculate contributions using the tree shap algorithm.
- "Weight": This method will use the internal leaf weights, to calculate the contributions. This is the same as what is described by Saabas here.
- "BranchDifference": This method will calculate contributions by subtracting the weight of the node the record will travel down by the weight of the other non-missing branch. This method does not have the property where the contributions summed is equal to the final prediction of the model.
- "MidpointDifference": This method will calculate contributions by subtracting the weight of the node the record will travel down by the mid-point between the right and left node weighted by the cover of each node. This method does not have the property where the contributions summed is equal to the final prediction of the model.
- "ModeDifference": This method will calculate contributions by subtracting the weight of the node the record will travel down by the weight of the node with the largest cover (the mode node). This method does not have the property where the contributions summed is equal to the final prediction of the model.
- "ProbabilityChange": This method is only valid when the objective type is set to "LogLoss". This method will calculate contributions as the change in a records probability of being 1 moving from a parent node to a child node. The sum of the returned contributions matrix, will be equal to the probability a record will be 1. For example, given a model,
model.predict_contributions(X, method="ProbabilityChange") == 1 / (1 + np.exp(-model.predict(X)))
-
parallel
(Union[bool, None]
, default:None
) –Optionally specify if the predict function should run in parallel on multiple threads. If
None
is passed, theparallel
attribute of the booster will be used. Defaults toNone
.
Returns:
-
ndarray
–np.ndarray: Returns a numpy array of the predictions.
partial_dependence
partial_dependence(X, feature: Union[str, int], samples: Optional[int] = 100, exclude_missing: bool = True, percentile_bounds: Tuple[float, float] = (0.2, 0.98)) -> np.ndarray
Calculate the partial dependence values of a feature. For each unique value of the feature, this gives the estimate of the predicted value for that feature, with the effects of all features averaged out. This information gives an estimate of how a given feature impacts the model.
Parameters:
-
X
(FrameLike
) –Either a pandas DataFrame, or a 2 dimensional numpy array. This should be the same data passed into the models fit, or predict, with the columns in the same order.
-
feature
(Union[str, int]
) –The feature for which to calculate the partial dependence values. This can be the name of a column, if the provided X is a pandas DataFrame, or the index of the feature.
-
samples
(Optional[int]
, default:100
) –Number of evenly spaced samples to select. If None is passed all unique values will be used. Defaults to 100.
-
exclude_missing
(bool
, default:True
) –Should missing excluded from the features? Defaults to True.
-
percentile_bounds
(Tuple[float, float]
, default:(0.2, 0.98)
) –Upper and lower percentiles to start at when calculating the samples. Defaults to (0.2, 0.98) to cap the samples selected at the 5th and 95th percentiles respectively.
Raises:
-
ValueError
–An error will be raised if the provided X parameter is not a pandas DataFrame, and a string is provided for the feature.
Returns:
-
ndarray
–np.ndarray: A 2 dimensional numpy array, where the first column is the sorted unique values of the feature, and then the second column is the partial dependence values for each feature value.
Example
This information can be plotted to visualize how a feature is used in the model, like so.
from seaborn import lineplot
import matplotlib.pyplot as plt
pd_values = model.partial_dependence(X=X, feature="age", samples=None)
fig = lineplot(x=pd_values[:,0], y=pd_values[:,1],)
plt.title("Partial Dependence Plot")
plt.xlabel("Age")
plt.ylabel("Log Odds")
We can see how this is impacted if a model is created, where a specific constraint is applied to the feature using the monotone_constraint
parameter.
model = PerpetualBooster(
objective="LogLoss",
monotone_constraints={"age": -1},
)
model.fit(X, y)
pd_values = model.partial_dependence(X=X, feature="age")
fig = lineplot(
x=pd_values[:, 0],
y=pd_values[:, 1],
)
plt.title("Partial Dependence Plot with Monotonicity")
plt.xlabel("Age")
plt.ylabel("Log Odds")
calculate_feature_importance
calculate_feature_importance(method: str = 'Gain', normalize: bool = True) -> Union[Dict[int, float], Dict[str, float]]
Feature importance values can be calculated with the calculate_feature_importance
method. This function will return a dictionary of the features and their importance values. It should be noted that if a feature was never used for splitting it will not be returned in importance dictionary.
Parameters:
-
method
(str
, default:'Gain'
) –Variable importance method. Defaults to "Gain". Valid options are:
- "Weight": The number of times a feature is used to split the data across all trees.
- "Gain": The average split gain across all splits the feature is used in.
- "Cover": The average coverage across all splits the feature is used in.
- "TotalGain": The total gain across all splits the feature is used in.
- "TotalCover": The total coverage across all splits the feature is used in.
-
normalize
(bool
, default:True
) –Should the importance be normalized to sum to 1? Defaults to
True
.
Returns:
-
Union[Dict[int, float], Dict[str, float]]
–Dict[str, float]: Variable importance values, for features present in the model.
text_dump
Return all of the trees of the model in text form.
Returns:
-
List[str]
–List[str]: A list of strings, where each string is a text representation of the tree.
Example:
model.text_dump()[0]
# 0:[0 < 3] yes=1,no=2,missing=2,gain=91.50833,cover=209.388307
# 1:[4 < 13.7917] yes=3,no=4,missing=4,gain=28.185467,cover=94.00148
# 3:[1 < 18] yes=7,no=8,missing=8,gain=1.4576768,cover=22.090348
# 7:[1 < 17] yes=15,no=16,missing=16,gain=0.691266,cover=0.705011
# 15:leaf=-0.15120,cover=0.23500
# 16:leaf=0.154097,cover=0.470007
json_dump
Return the booster object as a string.
Returns:
-
str
(str
) –The booster dumped as a json object in string form.
load_booster
classmethod
Load a booster object that was saved with the save_booster
method.
Parameters:
-
path
(str
) –Path to the saved booster file.
Returns:
-
PerpetualBooster
(Self
) –An initialized booster object.
save_booster
Save a booster object, the underlying representation is a json file.
Parameters:
-
path
(str
) –Path to save the booster object.
insert_metadata
Insert data into the models metadata, this will be saved on the booster object.
Parameters:
-
key
(str
) –Key to give the inserted value in the metadata.
-
value
(str
) –String value to assign to the key.
get_metadata
Get the value associated with a given key, on the boosters metadata.
Parameters:
-
key
(str
) –Key of item in metadata.
Returns:
-
str
(str
) –Value associated with the provided key in the boosters metadata.
get_params
Get all of the parameters for the booster.
Parameters:
-
deep
(bool
, default:True
) –This argument does nothing, and is simply here for scikit-learn compatibility.. Defaults to True.
Returns:
-
Dict[str, Any]
–Dict[str, Any]: The parameters of the booster.
set_params
Set the parameters of the booster, this has the same effect as reinstating the booster.
Returns:
-
PerpetualBooster
(Self
) –Booster with new parameters.
get_node_lists
Return the tree structures representation as a list of python objects.
Parameters:
-
map_features_names
(bool
, default:True
) –Should the feature names tried to be mapped to a string, if a pandas dataframe was used. Defaults to True.
Returns:
-
List[List[Node]]
–List[List[Node]]: A list of lists where each sub list is a tree, with all of it's respective nodes.
Example
This can be run directly to get the tree structure as python objects.
trees_to_dataframe
Return the tree structure as a Polars or Pandas DataFrame object.
Returns:
-
DataFrame
–Trees in a Polars or Pandas DataFrame.
Example
This can be used directly to print out the tree structure as a dataframe. The Leaf values will have the "Gain" column replaced with the weight value.
Tree | Node | ID | Feature | Split | Yes | No | Missing | Gain | Cover | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0-0 | pclass | 3 | 0-1 | 0-2 | 0-2 | 91.5083 | 209.388 |
1 | 0 | 1 | 0-1 | fare | 13.7917 | 0-3 | 0-4 | 0-4 | 28.1855 | 94.0015 |
Logging output
Info is logged while the model is being trained if the log_iterations
parameter is set to a value greater than 0
while fitting the booster. The logs can be printed to stdout while training like so.
import logging
logging.basicConfig()
logging.getLogger().setLevel(logging.INFO)
model = PerpetualBooster(log_iterations=1)
model.fit(X, y)
# INFO:perpetual.perpetualbooster:Completed iteration 0 of 10
# INFO:perpetual.perpetualbooster:Completed iteration 1 of 10
# INFO:perpetual.perpetualbooster:Completed iteration 2 of 10
The log output can also be captured in a file also using the logging.basicConfig()
filename
option.