stc_unicef_cpi.models package

Submodules

stc_unicef_cpi.models.inflated_vals_2stg module

class stc_unicef_cpi.models.inflated_vals_2stg.InflatedValsRegressor(classifier: ClassifierMixin, regressor: RegressorMixin)

Bases: BaseEstimator, RegressorMixin

A meta regressor for datasets with inflated values, i.e. the targets contain certain values with much higher frequency than others.

InflatedValsRegressor consists of a classifier and a regressor.

The classifier’s task is to find of if the target is an inflated value or not.

The regressor’s task is to output a prediction whenever the classifier indicates that the there should be a non-zero prediction.

The regressor is only trained on examples where the target is not an inflated value, which makes it easier for it to focus.

At prediction time, the classifier is first asked if the output should be one of the inflated values. Depending on the mode selected, either

output that value, or

use the estimated class probabilities to weight the output.

If not predicted to be an inflated value, in case (i) ask the regressor for its prediction and output it.

Examples

import numpy as np
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor

np.random.seed(0)
X = np.random.randn(10000, 4)
y = ((X[:, 0] > 0) & (X[:, 1] > 0)) * np.abs(X[:, 2] * X[:, 3] ** 2)
z = InflatedValsRegressor(
    classifier=ExtraTreesClassifier(random_state=0),
    regressor=ExtraTreesRegressor(random_state=0),
)
z.fit(X, y)
# InflatedValsRegressor(classifier=ExtraTreesClassifier(random_state=0),
#                     regressor=ExtraTreesRegressor(random_state=0))
z.predict(X)[:5]
# array([4.91483294, 0.        , 0.        , 0.04941909, 0.        ])

Fit the model.

Parameters

X (Union[np.ndarray, pd.DataFrame]) – The training data in shape (n_samples, n_features).
y (Union[np.ndarray, pd.Series]) – The target values, 1-dimensional.
inflated_vals (Union[List[float], np.ndarray], optional) – Inflated values, defaults to [0]
sample_weight (Optional[np.ndarray], optional) – Individual weights for each sample, defaults to None

Raises

ValueError – If classifier is not a classifier or regressor is not a regressor.

Returns

Fitted regressor.

Return type

InflatedValsRegressor

get_cls_labels(y: np.ndarray | pd.Series, inflated_vals: list[float] | np.ndarray, init=True) → pd.Series

Get class labels of targets, y, according to inflated values passed

Parameters

y (Union[np.ndarray, pd.Series]) – Target values
inflated_vals (Union[List[float], np.ndarray]) – Inflated values
init (bool, optional) – Initialise, defaults to True

Returns

Class labels

Return type

pd.Series

predict(X: np.ndarray | pd.DataFrame, weighted: bool = False, allow_nan: bool = True) → np.ndarray

Make predictions.

Parameters

X (Union[np.ndarray, pd.DataFrame]) – Samples to get predictions of, shape (n_samples, n_features).
weighted (bool, optional) – Weight output, or use strict class predictions, defaults to False

Returns

The predicted values.

Return type

np.ndarray, shape (n_samples,)

stc_unicef_cpi.models.lgbm_baseline module

stc_unicef_cpi.models.lgbm_baseline.adjusted_rsquared(r2, n, p)

stc_unicef_cpi.models.lgbm_baseline.basic_model_pipeline(model, num_scaler='robust', cat_encoder='ohe', imputer='simple')

From given sklearn style model make simple pipeline, along with suitable transformers Alt construction to below that takes advantage of sklearn fns

Args:: model (_type_): _description_ num_scaler (str, optional): _description_. Defaults to ‘robust’. cat_encoder (str, optional): _description_. Defaults to ‘ohe’. imputer (str, optional): _description_. Defaults to ‘simple’.

stc_unicef_cpi.models.lgbm_baseline.basic_preprocessor(X_train, model_type='lgb')

stc_unicef_cpi.models.lgbm_baseline.callback(study, trial)

stc_unicef_cpi.models.lgbm_baseline.flaml_multireg(X, Y, log_run=True, time_budget=60, scorer=<function <lambda>>)

stc_unicef_cpi.models.lgbm_baseline.get_card_split(df, cols, n=11)

Splits categorical columns into 2 lists based on cardinality (i.e # of unique values)

Parameters

dfPandas DataFrame: DataFrame from which the cardinality of the columns is calculated.
colslist-like: Categorical columns to list
nint, optional (default=11): The value of ‘n’ will be used to split columns.

Returns

card_lowlist-like: Columns with cardinality < n
card_highlist-like: Columns with cardinality >= n

stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optuna(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi', tracking_uri='../models/mlruns')

Use optuna / FLAML to train tuned LGBMRegressor NB expect target y to be a vector due to computational expense, and desire to log runs separately If need be run in loop Assume feature engineering etc already performed if desired

Note some thoughts in various blog posts e.g. here https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

Or just directly from docs https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

Main params to target:

num_leaves (max. limit should be 2^(max_depth) according to docs) - number of decision points in tree, given max_depth relatively
easy to choose, but expensive so choose conservative range e.g. (20,3000)
max_depth - number of levels, more makes more complex and prone to overfit,
too few and will underfit. Kaggle finds values of 3-12 works well for most datasets
min_data_in_leaf - min num observations that fit dec. crit. of each leaf,
should be >100 for larger datasets as helps prevent overfitting
n_estimators - Number of decision trees used - larger will be slower but should
be more accurate
learning_rate - step size param of gradient descent at each iteration, with
typical values between 0.01 and 0.3, sometimes lower. Perfect setup w n_estimators is many trees w early stopping and low lr
max_bin - default already 255, likely to cause overfitting if increased
reg_alpha or _lambda - L1 / L2 regularisation - good search range usually (0,100) for both
min_gain_to_split - conservative search range is (0,15), can help regularisation
bagging_fraction and feature_fraction - proportion of training samples (within (0,1), needs bagging_freq
set to an integer also) and proportion of features (also in (0,1)), respectively used to train each tree. Both can again help with overfitting
objective - the learning objective used, which can be custom (!)

Additionally use MLflow to log the run unless specified not to

Args:: X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (str, optional): _description_. Defaults to “nga-cpi”.
Returns:: _type_: _description_

stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optunaCV(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi')

Use optuna default tuner CV instead of above definition - only optimises

lambda_l1

lambda_l2

num_leaves

feature_fraction

bagging_fraction

bagging_freq

min_child_samples

Additionally use MLflow to log the run unless specified not to

Args:: X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (_type_, optional): _description_. Defaults to “nga-cpi”.
Returns:: _type_: _description_

stc_unicef_cpi.models.lgbm_baseline.objective(trial, X, y)

stc_unicef_cpi.models.lgbm_baseline.train_model(X_train, Y_train, X_test, Y_test, log_run=True, target_name='', model='lgb', experiment_name='nga-cpi')

Train baseline model

Parameters

X_train (_type_) – _description_
Y_train (_type_) – _description_
X_test (_type_) – _description_
Y_test (_type_) – _description_
log_run (bool, optional) – _description_, defaults to True
target_name (str, optional) – _description_, defaults to “”
model (str, optional) – _description_, defaults to “lgb”
experiment_name (str, optional) – _description_, defaults to “nga-cpi”

Returns

_description_

Return type

_type_

stc_unicef_cpi.models.mobnet_TL module

stc_unicef_cpi.models.predict_model module

stc_unicef_cpi.models.prediction_intervals module

stc_unicef_cpi.models.prediction_intervals.calibrate_prediction_intervals(pipeline_dir, pipeline_name, input_data, target_dim, mapie_dir): Train MAPIE Regressor using train data :param pipeline_dir: path to trained Pipeline instance :type pipeline_dir: str :param pipeline_name: name of trained Pipeline instance :type pipeline_name: str :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path for saving MapieRegressor instance :type mapie_dir: str :return: None :rtype: _type_

stc_unicef_cpi.models.prediction_intervals.predict_intervals(input_data, target_dim, mapie_dir, alpha=0.05, batch_size=10000, save_dir=None): Get prediction intervals for all data using fitted MapieRegressor :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path to saved MapieRegressor instance :type mapie_dir: str :param alpha: percent of out of intervals predictions tolerated, defualt is 0.05 :type alpha: int, optional :param batch_size: batch size for processing, default is 10000 :type batch_size: int, optional :param save_dir: path to save predictions csv :type save_dir: str, optional :return: If save_dir is None, pandas Dataframe else None :rtype: _type_

stc_unicef_cpi.models package

Submodules

stc_unicef_cpi.models.inflated_vals_2stg module

stc_unicef_cpi.models.lgbm_baseline module

stc_unicef_cpi.models.mobnet_TL module

stc_unicef_cpi.models.predict_model module

stc_unicef_cpi.models.prediction_intervals module

stc_unicef_cpi.models.train_model module