stc_unicef_cpi.features package

Submodules

stc_unicef_cpi.features.autoencoder_features module

stc_unicef_cpi.features.autoencoder_features.check_autoencoder_reconstruction(trained_autoencoder, input_data)

Plot reconstructed images to check performance of autoencoder

Parameters
  • trained_autoencoder (_type_) – output of get_trained_autoencoder() or saved keras model

  • input_data (_type_) – image input of size (num of samples, width, height, bands)

stc_unicef_cpi.features.autoencoder_features.get_best_hyperparameters(input_data, random_state=0, validation_split=0.1, batch_size=[64, 128], learning_rate=[0.01, 0.005, 0.001], epochs=100, logdir='autoencoder', project_name='tune_model', es_patience=5)

Get the tuned hyperparameters for the model

Parameters
  • input_data (_type_) – image input of size (num of samples, width, height, bands)

  • random_state (int, optional) – random state, defaults to 0

  • validation_split (float, optional) – split ratio for model.fit(), defaults to 0.1

  • batch_size (list, optional) – list of batch sizes to check, defaults to [64, 128]

  • learning_rate (list, optional) – list of learning rates to check, defaults to [1e-2, 5e-3, 1e-3]

  • epochs (int, optional) – max epochs for training; specify value slightly higher than expected convergence, defaults to 100

  • logdir (str, optional) – directory for logging, defaults to “autoencoder”

  • project_name (str, optional) – name of project, defaults to “tune_model”

  • es_patience (int, optional) – patience for early stopping, defaults to 5

Returns

dictionary with best learning rate, batch size & epoch

Return type

_type_

stc_unicef_cpi.features.autoencoder_features.get_encoded_features(trained_autoencoder_dir, model_name, hex_codes, tiff_files_dir, gpu, dim=16, batch_size=4096)

Get encoded features in batches

Parameters
  • trained_autoencoder_dir (_type_) – directory for saved keras model

  • model_name (_type_) – name of saved model inside trained_autoencoder_dir

  • hex_codes (_type_) – numpy array containing H3 hexagons for which to generate predictions

  • tiff_files_dir (_type_) – file path for geotiffs

  • dim (int, optional) – dimension for images extracted from rasters, defaults to 16

  • batch_size (int, optional) – batch size for getting predictions, defaults to 4096

Returns

numpy array of size (len(hex_codes), 32)

Return type

numpy array

stc_unicef_cpi.features.autoencoder_features.get_encoding_metrics(original_data, encoded_features)

Get residual variance, auc Trustworthiness & auc Continuity for encodings

Parameters
  • original_data (_type_) – image input of size (n_samples, width, height, bands) or (n_samples, width*height*bands)

  • encoded_features (_type_) – encoded features of size (n_samples, 32) or (n_samples, 2, 2, 8)

stc_unicef_cpi.features.autoencoder_features.get_train_data(tiff_dir, hex_codes, dim=16)

Get training data for the autoencoder

Parameters
  • tiff_dir (str) – directory containing all country tiffs

  • hex_codes (_type_) – H3 hexagons for training set

  • dim (int, optional) – dimension for images extracted from rasters, defaults to 16

Returns

numpy array of size (len(hex_codes), 16, 16, total channels from tiff_dir)

Return type

numpy array

stc_unicef_cpi.features.autoencoder_features.get_trained_autoencoder(input_data, batch_size=128, epochs=100, learning_rate=0.001, save_dir=None, model_name=None)

Get the trained model using tuned hyperparameters

Parameters
  • input_data (_type_) – image input of size (num of samples, width, height, bands) - reshaped & imputed input of convert_tiffs_to_image_dataset

  • batch_size (int, optional) – batch size for training, defaults to 128

  • learning_rate (int, optional) – learning_rate for Adam optimizer, defaults to 1e-3

  • epochs (int, optional) – number of epochs for training, defaults to 100

  • savedir (str, optional) – directory for saving model, defaults to None

  • model_name (str, optional) – name of saved h5 model file, defaults to “autoencoder”

Returns

If save_dir=None, Keras sequential model else None

Return type

_type_

stc_unicef_cpi.features.autoencoder_features.set_seed(random_state=0)

Set seed

stc_unicef_cpi.features.build_features module

stc_unicef_cpi.features.build_features.add_group_features(*dfs, join_on='')

TODO: fix for our data

From tuning using optuna in notebook, suggests that adding these features is indeed useful - seem to get slightly better CV performance using the full augmented dataset, along with similar performance for subset of features selected using BorutaShap (below) However, when assessing generalisation performance on completely new data, the cross-validated tuned model seems to significantly overfit on the base data, while performing much better on both fully augmented and subset of augmented data. Best generalised performance seems to be on subselection of features, as one might expect.

Parameters
  • dfs – any number of pandas dataframes to join for group features

  • join_on (str, optional) – column for entity sets to join on, defaults to “”

Type

dfs: pd.DataFrame

Returns

_description_

Return type

_type_

stc_unicef_cpi.features.build_features.boruta_shap_ftr_select(X, y, base_model=LGBMRegressor(), plot=True, n_trials=100, sample=False, train_or_test='test', normalize=True, verbose=True, incl_tentative=True)

Simple wrapper to BorutaShap feature selection to also show feature plot (more interesting at this point)

Parameters
  • X (_type_) – _description_

  • y (_type_) – _description_

  • base_model (_type_, optional) – _description_, defaults to lgb.LGBMRegressor()

  • plot (bool, optional) – show feature importance plot, defaults to True

  • n_trials (int, optional) – _description_, defaults to 100

  • sample (bool, optional) – if true then a row-wise sample of the data will be used to calculate the feature importance values, defaults to False

  • train_or_test (str, optional) – Decides whether the feature importance should be calculated on out of sample data - see the dicussion here https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data, defaults to “test”

  • normalize (bool, optional) – if true the importance values will be normalized using the z-score formula, defaults to True

  • verbose (bool, optional) – a flag indicator to print out all the rejected or accepted features, defaults to True

  • incl_tentative (bool, optional) – _description_, defaults to True

Returns

_description_

Return type

_type_

stc_unicef_cpi.features.get_autoencoder_features module

stc_unicef_cpi.features.get_autoencoder_features.copy_files(src, trg, word)
stc_unicef_cpi.features.get_autoencoder_features.retrieve_autoencoder_features(hex_codes, trained_autoencoder_dir, country, res, tiff_files_dir, gpu)

Predict autoencoder features :param hex_codes: _description_ :type hex_codes: _type_ :param trained_autoencoder_dir: _description_ :type trained_autoencoder_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_ :param tiff_files_dir: _description_ :type tiff_files_dir: _type_ :return: _description_ :rtype: _type_

stc_unicef_cpi.features.get_autoencoder_features.train_auto_encoder(hex_codes, read_dir, hyper_tunning, save_dir, country, res)

Train autoencoder model :param hex_codes: _description_ :type hex_codes: _type_ :param read_dir: _description_ :type read_dir: _type_ :param hyper_tunning: _description_ :type hyper_tunning: _type_ :param save_dir: _description_ :type save_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_

stc_unicef_cpi.features.resnet_pca module

stc_unicef_cpi.features.resnet_pca.get_features(img_arr, num_bands, pca_components, dim=32)

For each raster, get top PCA features using 2048 features from pretrained ResNet50

Inputs:

img_arr: output of convert_tiffs_to_image_dataset num_bands: first output of numbands_from_tiffs pca_components: number of components to reduce features to dim: dimensions for ResNet50 input; should match shape[2] and shape[3] of img_arr

Outputs:
array features of size (number of samples: img_arr.shape[0],

num_features: number of 3-groupings obtained from tiffs, pca_components: specified as funtion arg)

stc_unicef_cpi.features.resnet_pca.numbands_from_tiffs(dir)

Get the name and number of bands for each tiff

Inputs: tiff directory (string) Outputs: 2 equal size lists containing bands per tiff and tiff name