stc_unicef_cpi.models package

Submodules

stc_unicef_cpi.models.inflated_vals_2stg module

class stc_unicef_cpi.models.inflated_vals_2stg.InflatedValsRegressor(classifier: ClassifierMixin, regressor: RegressorMixin)

Bases: BaseEstimator, RegressorMixin

A meta regressor for datasets with inflated values, i.e. the targets contain certain values with much higher frequency than others.

InflatedValsRegressor consists of a classifier and a regressor.

  • The classifier’s task is to find of if the target is an inflated value or not.

  • The regressor’s task is to output a prediction whenever the classifier indicates that the there should be a non-zero prediction.

The regressor is only trained on examples where the target is not an inflated value, which makes it easier for it to focus.

At prediction time, the classifier is first asked if the output should be one of the inflated values. Depending on the mode selected, either

  1. output that value, or

  2. use the estimated class probabilities to weight the output.

If not predicted to be an inflated value, in case (i) ask the regressor for its prediction and output it.

Examples

import numpy as np
from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor

np.random.seed(0)
X = np.random.randn(10000, 4)
y = ((X[:, 0] > 0) & (X[:, 1] > 0)) * np.abs(X[:, 2] * X[:, 3] ** 2)
z = InflatedValsRegressor(
    classifier=ExtraTreesClassifier(random_state=0),
    regressor=ExtraTreesRegressor(random_state=0),
)
z.fit(X, y)
# InflatedValsRegressor(classifier=ExtraTreesClassifier(random_state=0),
#                     regressor=ExtraTreesRegressor(random_state=0))
z.predict(X)[:5]
# array([4.91483294, 0.        , 0.        , 0.04941909, 0.        ])
fit(X: np.ndarray | pd.DataFrame, y: np.ndarray | pd.Series, inflated_vals: list[float] | np.ndarray = [0], sample_weight: np.ndarray | None = None, allow_nan: bool = True, cls_fit_kwargs: dict | None = None, reg_fit_kwargs: dict | None = None) InflatedValsRegressor

Fit the model.

Parameters
  • X (Union[np.ndarray, pd.DataFrame]) – The training data in shape (n_samples, n_features).

  • y (Union[np.ndarray, pd.Series]) – The target values, 1-dimensional.

  • inflated_vals (Union[List[float], np.ndarray], optional) – Inflated values, defaults to [0]

  • sample_weight (Optional[np.ndarray], optional) – Individual weights for each sample, defaults to None

Raises

ValueError – If classifier is not a classifier or regressor is not a regressor.

Returns

Fitted regressor.

Return type

InflatedValsRegressor

get_cls_labels(y: np.ndarray | pd.Series, inflated_vals: list[float] | np.ndarray, init=True) pd.Series

Get class labels of targets, y, according to inflated values passed

Parameters
  • y (Union[np.ndarray, pd.Series]) – Target values

  • inflated_vals (Union[List[float], np.ndarray]) – Inflated values

  • init (bool, optional) – Initialise, defaults to True

Returns

Class labels

Return type

pd.Series

predict(X: np.ndarray | pd.DataFrame, weighted: bool = False, allow_nan: bool = True) np.ndarray

Make predictions.

Parameters
  • X (Union[np.ndarray, pd.DataFrame]) – Samples to get predictions of, shape (n_samples, n_features).

  • weighted (bool, optional) – Weight output, or use strict class predictions, defaults to False

Returns

The predicted values.

Return type

np.ndarray, shape (n_samples,)

stc_unicef_cpi.models.lgbm_baseline module

stc_unicef_cpi.models.lgbm_baseline.adjusted_rsquared(r2, n, p)
stc_unicef_cpi.models.lgbm_baseline.basic_model_pipeline(model, num_scaler='robust', cat_encoder='ohe', imputer='simple')

From given sklearn style model make simple pipeline, along with suitable transformers Alt construction to below that takes advantage of sklearn fns

Args:

model (_type_): _description_ num_scaler (str, optional): _description_. Defaults to ‘robust’. cat_encoder (str, optional): _description_. Defaults to ‘ohe’. imputer (str, optional): _description_. Defaults to ‘simple’.

stc_unicef_cpi.models.lgbm_baseline.basic_preprocessor(X_train, model_type='lgb')
stc_unicef_cpi.models.lgbm_baseline.callback(study, trial)
stc_unicef_cpi.models.lgbm_baseline.flaml_multireg(X, Y, log_run=True, time_budget=60, scorer=<function <lambda>>)
stc_unicef_cpi.models.lgbm_baseline.get_card_split(df, cols, n=11)

Splits categorical columns into 2 lists based on cardinality (i.e # of unique values)

Parameters

dfPandas DataFrame

DataFrame from which the cardinality of the columns is calculated.

colslist-like

Categorical columns to list

nint, optional (default=11)

The value of ‘n’ will be used to split columns.

Returns

card_lowlist-like

Columns with cardinality < n

card_highlist-like

Columns with cardinality >= n

stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optuna(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi', tracking_uri='../models/mlruns')

Use optuna / FLAML to train tuned LGBMRegressor NB expect target y to be a vector due to computational expense, and desire to log runs separately If need be run in loop Assume feature engineering etc already performed if desired

Note some thoughts in various blog posts e.g. here https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

Or just directly from docs https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html

Main params to target:

  • num_leaves (max. limit should be 2^(max_depth) according to docs) - number of decision points in tree, given max_depth relatively

    easy to choose, but expensive so choose conservative range e.g. (20,3000)

  • max_depth - number of levels, more makes more complex and prone to overfit,

    too few and will underfit. Kaggle finds values of 3-12 works well for most datasets

  • min_data_in_leaf - min num observations that fit dec. crit. of each leaf,

    should be >100 for larger datasets as helps prevent overfitting

  • n_estimators - Number of decision trees used - larger will be slower but should

    be more accurate

  • learning_rate - step size param of gradient descent at each iteration, with

    typical values between 0.01 and 0.3, sometimes lower. Perfect setup w n_estimators is many trees w early stopping and low lr

  • max_bin - default already 255, likely to cause overfitting if increased

  • reg_alpha or _lambda - L1 / L2 regularisation - good search range usually (0,100) for both

  • min_gain_to_split - conservative search range is (0,15), can help regularisation

  • bagging_fraction and feature_fraction - proportion of training samples (within (0,1), needs bagging_freq

    set to an integer also) and proportion of features (also in (0,1)), respectively used to train each tree. Both can again help with overfitting

  • objective - the learning objective used, which can be custom (!)

Additionally use MLflow to log the run unless specified not to

Args:

X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (str, optional): _description_. Defaults to “nga-cpi”.

Returns:

_type_: _description_

stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optunaCV(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi')

Use optuna default tuner CV instead of above definition - only optimises

  • lambda_l1

  • lambda_l2

  • num_leaves

  • feature_fraction

  • bagging_fraction

  • bagging_freq

  • min_child_samples

Additionally use MLflow to log the run unless specified not to

Args:

X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (_type_, optional): _description_. Defaults to “nga-cpi”.

Returns:

_type_: _description_

stc_unicef_cpi.models.lgbm_baseline.objective(trial, X, y)
stc_unicef_cpi.models.lgbm_baseline.train_model(X_train, Y_train, X_test, Y_test, log_run=True, target_name='', model='lgb', experiment_name='nga-cpi')

Train baseline model

Parameters
  • X_train (_type_) – _description_

  • Y_train (_type_) – _description_

  • X_test (_type_) – _description_

  • Y_test (_type_) – _description_

  • log_run (bool, optional) – _description_, defaults to True

  • target_name (str, optional) – _description_, defaults to “”

  • model (str, optional) – _description_, defaults to “lgb”

  • experiment_name (str, optional) – _description_, defaults to “nga-cpi”

Returns

_description_

Return type

_type_

stc_unicef_cpi.models.mobnet_TL module

stc_unicef_cpi.models.predict_model module

stc_unicef_cpi.models.prediction_intervals module

stc_unicef_cpi.models.prediction_intervals.calibrate_prediction_intervals(pipeline_dir, pipeline_name, input_data, target_dim, mapie_dir)

Train MAPIE Regressor using train data :param pipeline_dir: path to trained Pipeline instance :type pipeline_dir: str :param pipeline_name: name of trained Pipeline instance :type pipeline_name: str :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path for saving MapieRegressor instance :type mapie_dir: str :return: None :rtype: _type_

stc_unicef_cpi.models.prediction_intervals.predict_intervals(input_data, target_dim, mapie_dir, alpha=0.05, batch_size=10000, save_dir=None)

Get prediction intervals for all data using fitted MapieRegressor :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path to saved MapieRegressor instance :type mapie_dir: str :param alpha: percent of out of intervals predictions tolerated, defualt is 0.05 :type alpha: int, optional :param batch_size: batch size for processing, default is 10000 :type batch_size: int, optional :param save_dir: path to save predictions csv :type save_dir: str, optional :return: If save_dir is None, pandas Dataframe else None :rtype: _type_

stc_unicef_cpi.models.train_model module