stc_unicef_cpi.features package
Submodules
stc_unicef_cpi.features.autoencoder_features module
- stc_unicef_cpi.features.autoencoder_features.check_autoencoder_reconstruction(trained_autoencoder, input_data)
Plot reconstructed images to check performance of autoencoder
- Parameters
trained_autoencoder (_type_) – output of get_trained_autoencoder() or saved keras model
input_data (_type_) – image input of size (num of samples, width, height, bands)
- stc_unicef_cpi.features.autoencoder_features.get_best_hyperparameters(input_data, random_state=0, validation_split=0.1, batch_size=[64, 128], learning_rate=[0.01, 0.005, 0.001], epochs=100, logdir='autoencoder', project_name='tune_model', es_patience=5)
Get the tuned hyperparameters for the model
- Parameters
input_data (_type_) – image input of size (num of samples, width, height, bands)
random_state (int, optional) – random state, defaults to 0
validation_split (float, optional) – split ratio for model.fit(), defaults to 0.1
batch_size (list, optional) – list of batch sizes to check, defaults to [64, 128]
learning_rate (list, optional) – list of learning rates to check, defaults to [1e-2, 5e-3, 1e-3]
epochs (int, optional) – max epochs for training; specify value slightly higher than expected convergence, defaults to 100
logdir (str, optional) – directory for logging, defaults to “autoencoder”
project_name (str, optional) – name of project, defaults to “tune_model”
es_patience (int, optional) – patience for early stopping, defaults to 5
- Returns
dictionary with best learning rate, batch size & epoch
- Return type
_type_
- stc_unicef_cpi.features.autoencoder_features.get_encoded_features(trained_autoencoder_dir, model_name, hex_codes, tiff_files_dir, gpu, dim=16, batch_size=4096)
Get encoded features in batches
- Parameters
trained_autoencoder_dir (_type_) – directory for saved keras model
model_name (_type_) – name of saved model inside trained_autoencoder_dir
hex_codes (_type_) – numpy array containing H3 hexagons for which to generate predictions
tiff_files_dir (_type_) – file path for geotiffs
dim (int, optional) – dimension for images extracted from rasters, defaults to 16
batch_size (int, optional) – batch size for getting predictions, defaults to 4096
- Returns
numpy array of size (len(hex_codes), 32)
- Return type
numpy array
- stc_unicef_cpi.features.autoencoder_features.get_encoding_metrics(original_data, encoded_features)
Get residual variance, auc Trustworthiness & auc Continuity for encodings
- Parameters
original_data (_type_) – image input of size (n_samples, width, height, bands) or (n_samples, width*height*bands)
encoded_features (_type_) – encoded features of size (n_samples, 32) or (n_samples, 2, 2, 8)
- stc_unicef_cpi.features.autoencoder_features.get_train_data(tiff_dir, hex_codes, dim=16)
Get training data for the autoencoder
- Parameters
- Returns
numpy array of size (len(hex_codes), 16, 16, total channels from tiff_dir)
- Return type
numpy array
- stc_unicef_cpi.features.autoencoder_features.get_trained_autoencoder(input_data, batch_size=128, epochs=100, learning_rate=0.001, save_dir=None, model_name=None)
Get the trained model using tuned hyperparameters
- Parameters
input_data (_type_) – image input of size (num of samples, width, height, bands) - reshaped & imputed input of convert_tiffs_to_image_dataset
batch_size (int, optional) – batch size for training, defaults to 128
learning_rate (int, optional) – learning_rate for Adam optimizer, defaults to 1e-3
epochs (int, optional) – number of epochs for training, defaults to 100
savedir (str, optional) – directory for saving model, defaults to None
model_name (str, optional) – name of saved h5 model file, defaults to “autoencoder”
- Returns
If save_dir=None, Keras sequential model else None
- Return type
_type_
- stc_unicef_cpi.features.autoencoder_features.set_seed(random_state=0)
Set seed
stc_unicef_cpi.features.build_features module
- stc_unicef_cpi.features.build_features.add_group_features(*dfs, join_on='')
TODO: fix for our data
From tuning using optuna in notebook, suggests that adding these features is indeed useful - seem to get slightly better CV performance using the full augmented dataset, along with similar performance for subset of features selected using BorutaShap (below) However, when assessing generalisation performance on completely new data, the cross-validated tuned model seems to significantly overfit on the base data, while performing much better on both fully augmented and subset of augmented data. Best generalised performance seems to be on subselection of features, as one might expect.
- Parameters
dfs – any number of pandas dataframes to join for group features
join_on (str, optional) – column for entity sets to join on, defaults to “”
- Type
dfs: pd.DataFrame
- Returns
_description_
- Return type
_type_
- stc_unicef_cpi.features.build_features.boruta_shap_ftr_select(X, y, base_model=LGBMRegressor(), plot=True, n_trials=100, sample=False, train_or_test='test', normalize=True, verbose=True, incl_tentative=True)
Simple wrapper to BorutaShap feature selection to also show feature plot (more interesting at this point)
- Parameters
X (_type_) – _description_
y (_type_) – _description_
base_model (_type_, optional) – _description_, defaults to lgb.LGBMRegressor()
plot (bool, optional) – show feature importance plot, defaults to True
n_trials (int, optional) – _description_, defaults to 100
sample (bool, optional) – if true then a row-wise sample of the data will be used to calculate the feature importance values, defaults to False
train_or_test (str, optional) – Decides whether the feature importance should be calculated on out of sample data - see the dicussion here https://compstat-lmu.github.io/iml_methods_limitations/pfi-data.html#introduction-to-test-vs.training-data, defaults to “test”
normalize (bool, optional) – if true the importance values will be normalized using the z-score formula, defaults to True
verbose (bool, optional) – a flag indicator to print out all the rejected or accepted features, defaults to True
incl_tentative (bool, optional) – _description_, defaults to True
- Returns
_description_
- Return type
_type_
stc_unicef_cpi.features.get_autoencoder_features module
- stc_unicef_cpi.features.get_autoencoder_features.copy_files(src, trg, word)
- stc_unicef_cpi.features.get_autoencoder_features.retrieve_autoencoder_features(hex_codes, trained_autoencoder_dir, country, res, tiff_files_dir, gpu)
Predict autoencoder features :param hex_codes: _description_ :type hex_codes: _type_ :param trained_autoencoder_dir: _description_ :type trained_autoencoder_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_ :param tiff_files_dir: _description_ :type tiff_files_dir: _type_ :return: _description_ :rtype: _type_
- stc_unicef_cpi.features.get_autoencoder_features.train_auto_encoder(hex_codes, read_dir, hyper_tunning, save_dir, country, res)
Train autoencoder model :param hex_codes: _description_ :type hex_codes: _type_ :param read_dir: _description_ :type read_dir: _type_ :param hyper_tunning: _description_ :type hyper_tunning: _type_ :param save_dir: _description_ :type save_dir: _type_ :param country: _description_ :type country: _type_ :param res: _description_ :type res: _type_
stc_unicef_cpi.features.resnet_pca module
- stc_unicef_cpi.features.resnet_pca.get_features(img_arr, num_bands, pca_components, dim=32)
For each raster, get top PCA features using 2048 features from pretrained ResNet50
- Inputs:
img_arr: output of convert_tiffs_to_image_dataset num_bands: first output of numbands_from_tiffs pca_components: number of components to reduce features to dim: dimensions for ResNet50 input; should match shape[2] and shape[3] of img_arr
- Outputs:
- array features of size (number of samples: img_arr.shape[0],
num_features: number of 3-groupings obtained from tiffs, pca_components: specified as funtion arg)
- stc_unicef_cpi.features.resnet_pca.numbands_from_tiffs(dir)
Get the name and number of bands for each tiff
Inputs: tiff directory (string) Outputs: 2 equal size lists containing bands per tiff and tiff name