This repository contains the source code, scripts, and datasets for the Shades-of-Null evaluation suit (arxiv preprint). The evaluation suit uses SOTA missing value imputation (MVI) techniques on a suite of novel evaluation settings on popular fairness benchmark datasets, including multi-mechanism missingness (when several different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between development/training and deployment/testing), and using a large set of holistic evaluation metrics, including fairness and stability. The evaluation suit includes functionality for storing experiment results in a database, with MongoDB chosen for our purposes. Additionally, the evaluation suit is designed to be extensible, allowing researchers to incorporate custom datasets and apply new MVI techniques.
Create a virtual environment with Python 3.9 and install requirements:
python -m venv venv
source venv/bin/activate
pip3 install --upgrade pip3
pip3 install -r requiremnents.txt
Install datawig:
pip3 install mxnet-cu110
pip3 install datawig --no-deps
# In case of an import error for libcuda.so, use the command below recommended in
# https://stackoverflow.com/questions/54249577/importerror-libcuda-so-1-cannot-open-shared-object-file
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/compat
Please note that the NOMI imputer requires tensorflow~=2.16.1
and neural-tangents~=0.6.5
, which conflicts with our requirements.txt
. Therefore, you may need to create a separate virtual environment for NOMI with the same library versions as in the requirements.txt, but include aforementioned versions of tensorflow
and neural-tangents
.
Add MongoDB secrets (optional)
# Create configs/secrets.env file with database variables
DB_NAME=your_mongodb_name
CONNECTION_STRING=your_mongodb_connection_string
source
directory contains code with custom classes for managing benchmark, database client, error injectors, null imputers, visualizations and some utils functions.configs
directory contains all constants and configs for datasets, null imputers, ML models and evaluation scenarios.scripts
directory contains main scripts for evaluating null imputers, baselines and ML models.tests
directory contains tests covering the benchmark and null imputers.notebooks
directory contains Jupyter notebooks with EDA and results visualization.cluster_analysis
subdirectory contains notebooks with analysis of the number of clusters in each dataset using silhoette scores and PCA, t-SNE, UMAP algorithms. Used to choose the correct number of clusters for theclustering
null imputer.EDA
subdirectory contains notebooks with analysis of feature importance and feature correlation with the target for 7 datasets used in our experiments (Section 3.1 and Appendix B.2 in the paper).visualizations
subdirectory contains two subdirectories with visualisations for imputation performance and model performance. Each of these subdirectories has the following structure:single_mechanism_exp
folder includes plots for single-mechanism missingness in both train and test sets (Section 4.1, 4.2, 4.3 and Appendix C.2 in the paper).multi_mechanism_exp
folder includes plots for multi-mechanism missingness in both train and test sets (Section 4.1, 4.2, 4.3 and Appendix C.2 in the paper).exp1
folder includes plots for missingness shift with a fixed error rate in both train and test sets (Appendix D.1 in the paper).exp2
folder includes plots for missingness shift with a variable error rate in the train set and a fixed error rate in the test set (Section 5 and Appendix D in the paper).exp3
folder includes plots for missingness shift with a fixed error rate in the train set and a variable error rate in the test set (Section 5 and Appendix D in the paper).
Scatter_Plots.ipynb
notebook includes scatter plots for single-mechanism and multi-mechanism missingness colored by null imputers and shaped by datasets (Section 4.4 and Appendix C.2 in the paper).Time_Consumption.ipynb
notebook includes tables with training time for each imputer for single-mechanism and multi-mechanism missingness (Appendix C.1 in the paper).Correlations.ipynb
notebook includes plots for Spearman correlation between MVI technique, model type, test missingness, and performance metrics (F1, fairness and stability) for different train missingness mechanisms (Section 6 and Appendix E in the paper).
Below we summarize a list of MVI techniques available in our benchmark, their source repos, papers and links to our adaptation in code.
This console command evaluates single or multiple null imputation techniques on the selected dataset. The argument evaluation_scenarios
defines which evaluation scenarios to use. Available scenarios are listed in configs/scenarios_config.py
, but users have an option to create own evaluation scenarios. tune_imputers
is a bool parameter whether to tune imputers or to reuse hyper-parameters from NULL_IMPUTERS_HYPERPARAMS in configs/null_imputers_config.py
. save_imputed_datasets
is a bool parameter whether to save imputed datasets locally for future use. dataset
and null_imputers
arguments should be chosen from supported datasets and techniques. run_nums
defines run numbers for different seeds, for example, the number 3 corresponds to 300 seed defined in EXPERIMENT_RUN_SEEDS in configs/constants.py
.
python ./scripts/impute_nulls_with_predictor.py \
--dataset folk \
--null_imputers [\"miss_forest\",\"datawig\"] \
--run_nums [1,2,3] \
--tune_imputers true \
--save_imputed_datasets true \
--evaluation_scenarios [\"exp1_mcar3\"]
This console command evaluates single or multiple null imputation techniques along with ML models training on the selected dataset. Arguments evaluation_scenarios
, dataset
, null_imputers
, run_nums
are used for the same purpose as in impute_nulls_with_predictor.py
. models
defines which ML models to evaluate in the pipeline. ml_impute
is a bool argument which decides whether to impute null dynamically or use precomputed saved datasets with imputed values (if they are available).
python ./scripts/evaluate_models.py \
--dataset folk \
--null_imputers [\"miss_forest\",\"datawig\"] \
--models [\"lr_clf\",\"mlp_clf\"] \
--run_nums [1,2,3] \
--tune_imputers true \
--save_imputed_datasets true \
--ml_impute true \
--evaluation_scenarios [\"exp1_mcar3\"]
This console command evaluates ML models on clean datasets (without injected nulls) for getting baseline metrics. Arguments follow same logic as in evaluate_models.py
.
python ./scripts/evaluate_baseline.py \
--dataset folk \
--models [\"lr_clf\",\"mlp_clf\"] \
--run_nums [1,2,3]
- To add a new dataset, you need to use Virny wrapper BaseFlowDataset, where reading and basic preprocessing take place (link to documentation).
- Create a
config yaml
file inconfigs/yaml_files
with settings for the number of estimators, bootstrap fraction and sensitive attributes dict like in example below.
dataset_name: folk
bootstrap_fraction: 0.8
n_estimators: 50
computation_mode: error_analysis
sensitive_attributes_dct: {'SEX': '2', 'RAC1P': ['2', '3', '4', '5', '6', '7', '8', '9'], 'SEX & RAC1P': None}
- In
configs/dataset_config.py
, add a newly created wrapper for your dataset specifing kwarg arguments, test set fraction and config yaml path in theDATASET_CONFIG
dict.
- To add a new model, add the model name to
MLModels
enum inconfigs/constants.py
. - Set up a model instance and hyper-parameters grid for tuning inside the function
get_models_params_for_tuning
inconfigs/models_config_for_tuning.py
. Model instance should inherit sklearn BaseEstimator from scikit-learn in order to support logic with tuning and fitting model (link to documentation).
- Create a new imputation method for your imputer in
source/null_imputers/imputation_methods.py
similar to:
def new_imputation_method(X_train_with_nulls: pd.DataFrame, X_tests_with_nulls_lst: list,
numeric_columns_with_nulls: list, categorical_columns_with_nulls: list,
hyperparams: dict, **kwargs):
"""
This method imputes nulls using the new null imputer method.
Arguments:
X_train_with_nulls -- a training features df with nulls in numeric_columns_with_nulls and categorical_columns_with_nulls columns
X_tests_with_nulls_lst -- a list of different X test dfs with nulls in numeric_columns_with_nulls and categorical_columns_with_nulls columns
numeric_columns_with_nulls -- a list of numerical column names with nulls
categorical_columns_with_nulls -- a list of categorical column names with nulls
hyperparams -- a dictionary of tuned hyperparams for the null imputer
kwargs -- all other params needed for the null imputer
Returns:
X_train_imputed (pd.DataFrame) -- a training features df with imputed columns defined in numeric_columns_with_nulls
and categorical_columns_with_nulls
X_tests_imputed_lst (list) -- a list of test features df with imputed columns defined in numeric_columns_with_nulls
and categorical_columns_with_nulls
null_imputer_params_dct (dict) -- a dictionary where a keys is a column name with nulls, and
a value is a dictionary of null imputer parameters used to impute this column
"""
# Write here either a call to the algorithm or the algorithm itself
...
return X_train_imputed, X_tests_imputed_lst, null_imputer_params_dct
- Add the configuration of your new imputer to
configs/null_imputers_config.py
to the NULL_IMPUTERS_CONFIG dictionary. - Add your imputer name to the ErrorRepairMethod enum in
configs/constants.py
. - [Optional] If a standard imputation pipeline does not work for a new null imputer, add a new if-statement to
source/custom_classes/benchmark.py
to the _impute_nulls method.
- Add a configuration for the new missingness scenario and the desired dataset to the
ERROR_INJECTION_SCENARIOS_CONFIG
dict inconfigs/scenarios_config.py
. Missingness scenario should follow the structure below:missing_features
are columns for null injection, andsetting
is a dict, specifying error rates and conditions for error injection.
ACS_INCOME_DATASET: {
"MCAR": [
{
'missing_features': ['WKHP', 'AGEP', 'SCHL', 'MAR'],
'setting': {'error_rates': [0.1, 0.2, 0.3, 0.4, 0.5]},
},
],
"MAR": [
{
'missing_features': ['WKHP', 'SCHL'],
'setting': {'condition': ('SEX', '2'), 'error_rates': [0.08, 0.12, 0.20, 0.28, 0.35]}
}
],
...
}
- Create a new evaluation scenario with the new missingness scenario in the
EVALUATION_SCENARIOS_CONFIG
dict inconfigs/scenarios_config.py
. A new missingness scenario can be used alone or combined with others.train_injection_scenario
andtest_injection_scenarios
define settings of error injection for train and test sets, respectively.test_injection_scenarios
takes a list as an input since the benchmark has an optimisation for multiple test sets.
EVALUATION_SCENARIOS_CONFIG = {
'mixed_exp': {
'train_injection_scenario': 'MCAR1 & MAR1 & MNAR1',
'test_injection_scenarios': ['MCAR1 & MAR1 & MNAR1'],
},
'exp1_mcar3': {
'train_injection_scenario': 'MCAR3',
'test_injection_scenarios': ['MCAR3', 'MAR3', 'MNAR3'],
},
...
}