stc_unicef_cpi.models package
Submodules
stc_unicef_cpi.models.inflated_vals_2stg module
- class stc_unicef_cpi.models.inflated_vals_2stg.InflatedValsRegressor(classifier: ClassifierMixin, regressor: RegressorMixin)
Bases:
BaseEstimator,RegressorMixinA meta regressor for datasets with inflated values, i.e. the targets contain certain values with much higher frequency than others.
InflatedValsRegressor consists of a classifier and a regressor.
The classifier’s task is to find of if the target is an inflated value or not.
The regressor’s task is to output a prediction whenever the classifier indicates that the there should be a non-zero prediction.
The regressor is only trained on examples where the target is not an inflated value, which makes it easier for it to focus.
At prediction time, the classifier is first asked if the output should be one of the inflated values. Depending on the mode selected, either
output that value, or
use the estimated class probabilities to weight the output.
If not predicted to be an inflated value, in case (i) ask the regressor for its prediction and output it.
Examples
import numpy as np from sklearn.ensemble import ExtraTreesClassifier, ExtraTreesRegressor np.random.seed(0) X = np.random.randn(10000, 4) y = ((X[:, 0] > 0) & (X[:, 1] > 0)) * np.abs(X[:, 2] * X[:, 3] ** 2) z = InflatedValsRegressor( classifier=ExtraTreesClassifier(random_state=0), regressor=ExtraTreesRegressor(random_state=0), ) z.fit(X, y) # InflatedValsRegressor(classifier=ExtraTreesClassifier(random_state=0), # regressor=ExtraTreesRegressor(random_state=0)) z.predict(X)[:5] # array([4.91483294, 0. , 0. , 0.04941909, 0. ])
- fit(X: np.ndarray | pd.DataFrame, y: np.ndarray | pd.Series, inflated_vals: list[float] | np.ndarray = [0], sample_weight: np.ndarray | None = None, allow_nan: bool = True, cls_fit_kwargs: dict | None = None, reg_fit_kwargs: dict | None = None) InflatedValsRegressor
Fit the model.
- Parameters
X (Union[np.ndarray, pd.DataFrame]) – The training data in shape (n_samples, n_features).
y (Union[np.ndarray, pd.Series]) – The target values, 1-dimensional.
inflated_vals (Union[List[float], np.ndarray], optional) – Inflated values, defaults to [0]
sample_weight (Optional[np.ndarray], optional) – Individual weights for each sample, defaults to None
- Raises
ValueError – If classifier is not a classifier or regressor is not a regressor.
- Returns
Fitted regressor.
- Return type
- get_cls_labels(y: np.ndarray | pd.Series, inflated_vals: list[float] | np.ndarray, init=True) pd.Series
Get class labels of targets, y, according to inflated values passed
- predict(X: np.ndarray | pd.DataFrame, weighted: bool = False, allow_nan: bool = True) np.ndarray
Make predictions.
- Parameters
X (Union[np.ndarray, pd.DataFrame]) – Samples to get predictions of, shape (n_samples, n_features).
weighted (bool, optional) – Weight output, or use strict class predictions, defaults to False
- Returns
The predicted values.
- Return type
np.ndarray, shape (n_samples,)
stc_unicef_cpi.models.lgbm_baseline module
- stc_unicef_cpi.models.lgbm_baseline.adjusted_rsquared(r2, n, p)
- stc_unicef_cpi.models.lgbm_baseline.basic_model_pipeline(model, num_scaler='robust', cat_encoder='ohe', imputer='simple')
From given sklearn style model make simple pipeline, along with suitable transformers Alt construction to below that takes advantage of sklearn fns
- Args:
model (_type_): _description_ num_scaler (str, optional): _description_. Defaults to ‘robust’. cat_encoder (str, optional): _description_. Defaults to ‘ohe’. imputer (str, optional): _description_. Defaults to ‘simple’.
- stc_unicef_cpi.models.lgbm_baseline.basic_preprocessor(X_train, model_type='lgb')
- stc_unicef_cpi.models.lgbm_baseline.callback(study, trial)
- stc_unicef_cpi.models.lgbm_baseline.flaml_multireg(X, Y, log_run=True, time_budget=60, scorer=<function <lambda>>)
- stc_unicef_cpi.models.lgbm_baseline.get_card_split(df, cols, n=11)
Splits categorical columns into 2 lists based on cardinality (i.e # of unique values)
Parameters
- dfPandas DataFrame
DataFrame from which the cardinality of the columns is calculated.
- colslist-like
Categorical columns to list
- nint, optional (default=11)
The value of ‘n’ will be used to split columns.
Returns
- card_lowlist-like
Columns with cardinality < n
- card_highlist-like
Columns with cardinality >= n
- stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optuna(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi', tracking_uri='../models/mlruns')
Use optuna / FLAML to train tuned LGBMRegressor NB expect target y to be a vector due to computational expense, and desire to log runs separately If need be run in loop Assume feature engineering etc already performed if desired
Note some thoughts in various blog posts e.g. here https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5
Or just directly from docs https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
Main params to target:
- num_leaves (max. limit should be 2^(max_depth) according to docs) - number of decision points in tree, given max_depth relatively
easy to choose, but expensive so choose conservative range e.g. (20,3000)
- max_depth - number of levels, more makes more complex and prone to overfit,
too few and will underfit. Kaggle finds values of 3-12 works well for most datasets
- min_data_in_leaf - min num observations that fit dec. crit. of each leaf,
should be >100 for larger datasets as helps prevent overfitting
- n_estimators - Number of decision trees used - larger will be slower but should
be more accurate
- learning_rate - step size param of gradient descent at each iteration, with
typical values between 0.01 and 0.3, sometimes lower. Perfect setup w n_estimators is many trees w early stopping and low lr
max_bin - default already 255, likely to cause overfitting if increased
reg_alpha or _lambda - L1 / L2 regularisation - good search range usually (0,100) for both
min_gain_to_split - conservative search range is (0,15), can help regularisation
- bagging_fraction and feature_fraction - proportion of training samples (within (0,1), needs bagging_freq
set to an integer also) and proportion of features (also in (0,1)), respectively used to train each tree. Both can again help with overfitting
objective - the learning objective used, which can be custom (!)
Additionally use MLflow to log the run unless specified not to
- Args:
X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (str, optional): _description_. Defaults to “nga-cpi”.
- Returns:
_type_: _description_
- stc_unicef_cpi.models.lgbm_baseline.lgbmreg_optunaCV(X_train, X_test, y_train, y_test, log_run=True, target_name='test', logging_level=40, experiment_name='nga-cpi')
Use optuna default tuner CV instead of above definition - only optimises
lambda_l1
lambda_l2
num_leaves
feature_fraction
bagging_fraction
bagging_freq
min_child_samples
Additionally use MLflow to log the run unless specified not to
- Args:
X_train (_type_): _description_ X_test (_type_): _description_ y_train (_type_): _description_ y_test (_type_): _description_ log_run (bool, optional): _description_. Defaults to True. target_name (str, optional): _description_. Defaults to “test”. logging_level (_type_, optional): _description_. Defaults to optuna.logging.ERROR. experiment_name (_type_, optional): _description_. Defaults to “nga-cpi”.
- Returns:
_type_: _description_
- stc_unicef_cpi.models.lgbm_baseline.objective(trial, X, y)
- stc_unicef_cpi.models.lgbm_baseline.train_model(X_train, Y_train, X_test, Y_test, log_run=True, target_name='', model='lgb', experiment_name='nga-cpi')
Train baseline model
- Parameters
X_train (_type_) – _description_
Y_train (_type_) – _description_
X_test (_type_) – _description_
Y_test (_type_) – _description_
log_run (bool, optional) – _description_, defaults to True
target_name (str, optional) – _description_, defaults to “”
model (str, optional) – _description_, defaults to “lgb”
experiment_name (str, optional) – _description_, defaults to “nga-cpi”
- Returns
_description_
- Return type
_type_
stc_unicef_cpi.models.mobnet_TL module
stc_unicef_cpi.models.predict_model module
stc_unicef_cpi.models.prediction_intervals module
- stc_unicef_cpi.models.prediction_intervals.calibrate_prediction_intervals(pipeline_dir, pipeline_name, input_data, target_dim, mapie_dir)
Train MAPIE Regressor using train data :param pipeline_dir: path to trained Pipeline instance :type pipeline_dir: str :param pipeline_name: name of trained Pipeline instance :type pipeline_name: str :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path for saving MapieRegressor instance :type mapie_dir: str :return: None :rtype: _type_
- stc_unicef_cpi.models.prediction_intervals.predict_intervals(input_data, target_dim, mapie_dir, alpha=0.05, batch_size=10000, save_dir=None)
Get prediction intervals for all data using fitted MapieRegressor :param input_data: Dataframe containing all data :type input_data: _type_ :param target_dim: dimension to predict :type target_dim: str :param mapie_dir: path to saved MapieRegressor instance :type mapie_dir: str :param alpha: percent of out of intervals predictions tolerated, defualt is 0.05 :type alpha: int, optional :param batch_size: batch size for processing, default is 10000 :type batch_size: int, optional :param save_dir: path to save predictions csv :type save_dir: str, optional :return: If save_dir is None, pandas Dataframe else None :rtype: _type_