stc_unicef_cpi.features package

Submodules

stc_unicef_cpi.features.autoencoder_features module

stc_unicef_cpi.features.autoencoder_features.check_autoencoder_reconstruction(trained_autoencoder, input_data)

Plot reconstructed images to check performance of autoencoder

Parameters

trained_autoencoder (_type_) – output of get_trained_autoencoder() or saved keras model
input_data (_type_) – image input of size (num of samples, width, height, bands)

stc_unicef_cpi.features.autoencoder_features.get_best_hyperparameters(input_data, random_state=0, validation_split=0.1, batch_size=[64, 128], learning_rate=[0.01, 0.005, 0.001], epochs=100, logdir='autoencoder', project_name='tune_model', es_patience=5)

Get the tuned hyperparameters for the model

Parameters

input_data (_type_) – image input of size (num of samples, width, height, bands)
random_state (int, optional) – random state, defaults to 0
validation_split (float, optional) – split ratio for model.fit(), defaults to 0.1
batch_size (list, optional) – list of batch sizes to check, defaults to [64, 128]
learning_rate (list, optional) – list of learning rates to check, defaults to [1e-2, 5e-3, 1e-3]
epochs (int, optional) – max epochs for training; specify value slightly higher than expected convergence, defaults to 100
logdir (str, optional) – directory for logging, defaults to “autoencoder”
project_name (str, optional) – name of project, defaults to “tune_model”
es_patience (int, optional) – patience for early stopping, defaults to 5

Returns

dictionary with best learning rate, batch size & epoch

Return type

_type_

stc_unicef_cpi.features.autoencoder_features.get_encoded_features(trained_autoencoder_dir, model_name, hex_codes, tiff_files_dir, gpu, dim=16, batch_size=4096)

Get encoded features in batches

Parameters

trained_autoencoder_dir (_type_) – directory for saved keras model
model_name (_type_) – name of saved model inside trained_autoencoder_dir
hex_codes (_type_) – numpy array containing H3 hexagons for which to generate predictions
tiff_files_dir (_type_) – file path for geotiffs
dim (int, optional) – dimension for images extracted from rasters, defaults to 16
batch_size (int, optional) – batch size for getting predictions, defaults to 4096

Returns

numpy array of size (len(hex_codes), 32)

Return type

numpy array

stc_unicef_cpi.features.autoencoder_features.get_encoding_metrics(original_data, encoded_features)

Get residual variance, auc Trustworthiness & auc Continuity for encodings

Parameters

original_data (_type_) – image input of size (n_samples, width, height, bands) or (n_samples, width*height*bands)
encoded_features (_type_) – encoded features of size (n_samples, 32) or (n_samples, 2, 2, 8)

stc_unicef_cpi.features.autoencoder_features.get_train_data(tiff_dir, hex_codes, dim=16)

Get training data for the autoencoder

Parameters

tiff_dir (str) – directory containing all country tiffs
hex_codes (_type_) – H3 hexagons for training set
dim (int, optional) – dimension for images extracted from rasters, defaults to 16

Returns

numpy array of size (len(hex_codes), 16, 16, total channels from tiff_dir)

Return type

numpy array

stc_unicef_cpi.features.autoencoder_features.get_trained_autoencoder(input_data, batch_size=128, epochs=100, learning_rate=0.001, save_dir=None, model_name=None)

Get the trained model using tuned hyperparameters

Parameters

input_data (_type_) – image input of size (num of samples, width, height, bands) - reshaped & imputed input of convert_tiffs_to_image_dataset
batch_size (int, optional) – batch size for training, defaults to 128
learning_rate (int, optional) – learning_rate for Adam optimizer, defaults to 1e-3
epochs (int, optional) – number of epochs for training, defaults to 100
savedir (str, optional) – directory for saving model, defaults to None
model_name (str, optional) – name of saved h5 model file, defaults to “autoencoder”

Returns

If save_dir=None, Keras sequential model else None

Return type

_type_

stc_unicef_cpi.features.autoencoder_features.set_seed(random_state=0): Set seed

stc_unicef_cpi.features.build_features module

stc_unicef_cpi.features.build_features.add_group_features(*dfs, join_on='')

TODO: fix for our data

From tuning using optuna in notebook, suggests that adding these features is indeed useful - seem to get slightly better CV performance using the full augmented dataset, along with similar performance for subset of features selected using BorutaShap (below) However, when assessing generalisation performance on completely new data, the cross-validated tuned model seems to significantly overfit on the base data, while performing much better on both fully augmented and subset of augmented data. Best generalised performance seems to be on subselection of features, as one might expect.

Parameters

dfs – any number of pandas dataframes to join for group features
join_on (str, optional) – column for entity sets to join on, defaults to “”

Type

dfs: pd.DataFrame

Returns

_description_

Return type

_type_

stc_unicef_cpi.features.build_features.boruta_shap_ftr_select(X, y, base_model=LGBMRegressor(), plot=True, n_trials=100, sample=False, train_or_test='test', normalize=True, verbose=True, incl_tentative=True)

Simple wrapper to BorutaShap feature selection to also show feature plot (more interesting at this point)

Parameters

X (_type_) – _description_
y (_type_) – _description_
base_model (_type_, optional) – _description_, defaults to lgb.LGBMRegressor()
plot (bool, optional) – show feature importance plot, defaults to True
n_trials (int, optional) – _description_, defaults to 100
sample (bool, optional) – if true then a row-wise sample of the data will be used to calculate the feature importance values, defaults to False
train_or_test (str, optional) – Decides whether the feature importance should be calculated on out of sample data - see the dicussion here https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data, defaults to “test”
normalize (bool, optional) – if true the importance values will be normalized using the z-score formula, defaults to True
verbose (bool, optional) – a flag indicator to print out all the rejected or accepted features, defaults to True
incl_tentative (bool, optional) – _description_, defaults to True

Returns

_description_

Return type

_type_

stc_unicef_cpi.features.get_autoencoder_features module

stc_unicef_cpi.features.get_autoencoder_features.copy_files(src, trg, word)

stc_unicef_cpi.features.get_autoencoder_features.retrieve_autoencoder_features(hex_codes, trained_autoencoder_dir, country, res, tiff_files_dir, gpu): Predict autoencoder features :param hex_codes: _description_ :type hex_codes: _type_ :param trained_autoencoder_dir: _description_ :type trained_autoencoder_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_ :param tiff_files_dir: _description_ :type tiff_files_dir: _type_ :return: _description_ :rtype: _type_

stc_unicef_cpi.features.get_autoencoder_features.train_auto_encoder(hex_codes, read_dir, hyper_tunning, save_dir, country, res): Train autoencoder model :param hex_codes: _description_ :type hex_codes: _type_ :param read_dir: _description_ :type read_dir: _type_ :param hyper_tunning: _description_ :type hyper_tunning: _type_ :param save_dir: _description_ :type save_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_

stc_unicef_cpi.features.resnet_pca module

stc_unicef_cpi.features.resnet_pca.get_features(img_arr, num_bands, pca_components, dim=32)

For each raster, get top PCA features using 2048 features from pretrained ResNet50

Inputs:

img_arr: output of convert_tiffs_to_image_dataset num_bands: first output of numbands_from_tiffs pca_components: number of components to reduce features to dim: dimensions for ResNet50 input; should match shape[2] and shape[3] of img_arr

Outputs:

array features of size (number of samples: img_arr.shape[0],: num_features: number of 3-groupings obtained from tiffs, pca_components: specified as funtion arg)

stc_unicef_cpi.features.resnet_pca.numbands_from_tiffs(dir)

Get the name and number of bands for each tiff

Inputs: tiff directory (string) Outputs: 2 equal size lists containing bands per tiff and tiff name