Skip to content

Studying the impact of data cleaning techniques on fairness and stability

License

Notifications You must be signed in to change notification settings

FalaahArifKhan/data-cleaning-stability

Repository files navigation

Still More Shades of Null: An Evaluation Suite for Responsible Missing Value Imputation

Benchmark Architecture

This repository contains the source code, scripts, and datasets for the Shades-of-Null evaluation suit (arxiv preprint). The evaluation suit uses SOTA missing value imputation (MVI) techniques on a suite of novel evaluation settings on popular fairness benchmark datasets, including multi-mechanism missingness (when several different missingness patterns co-exist in the data) and missingness shift (when the missingness mechanism changes between development/training and deployment/testing), and using a large set of holistic evaluation metrics, including fairness and stability. The evaluation suit includes functionality for storing experiment results in a database, with MongoDB chosen for our purposes. Additionally, the evaluation suit is designed to be extensible, allowing researchers to incorporate custom datasets and apply new MVI techniques.

Setup

Create a virtual environment with Python 3.9 and install requirements:

python -m venv venv 
source venv/bin/activate
pip3 install --upgrade pip3
pip3 install -r requiremnents.txt

Install datawig:

pip3 install mxnet-cu110
pip3 install datawig --no-deps

# In case of an import error for libcuda.so, use the command below recommended in
# https://stackoverflow.com/questions/54249577/importerror-libcuda-so-1-cannot-open-shared-object-file
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/compat

Please note that the NOMI imputer requires tensorflow~=2.16.1 and neural-tangents~=0.6.5, which conflicts with our requirements.txt. Therefore, you may need to create a separate virtual environment for NOMI with the same library versions as in the requirements.txt, but include aforementioned versions of tensorflow and neural-tangents.

Add MongoDB secrets (optional)

# Create configs/secrets.env file with database variables
DB_NAME=your_mongodb_name
CONNECTION_STRING=your_mongodb_connection_string

Repository structure

  • source directory contains code with custom classes for managing benchmark, database client, error injectors, null imputers, visualizations and some utils functions.
  • configs directory contains all constants and configs for datasets, null imputers, ML models and evaluation scenarios.
  • scripts directory contains main scripts for evaluating null imputers, baselines and ML models.
  • tests directory contains tests covering the benchmark and null imputers.
  • notebooks directory contains Jupyter notebooks with EDA and results visualization.
    • cluster_analysis subdirectory contains notebooks with analysis of the number of clusters in each dataset using silhoette scores and PCA, t-SNE, UMAP algorithms. Used to choose the correct number of clusters for the clustering null imputer.
    • EDA subdirectory contains notebooks with analysis of feature importance and feature correlation with the target for 7 datasets used in our experiments (Section 3.1 and Appendix B.2 in the paper).
    • visualizations subdirectory contains two subdirectories with visualisations for imputation performance and model performance. Each of these subdirectories has the following structure:
      • single_mechanism_exp folder includes plots for single-mechanism missingness in both train and test sets (Section 4.1, 4.2, 4.3 and Appendix C.2 in the paper).
      • multi_mechanism_exp folder includes plots for multi-mechanism missingness in both train and test sets (Section 4.1, 4.2, 4.3 and Appendix C.2 in the paper).
      • exp1 folder includes plots for missingness shift with a fixed error rate in both train and test sets (Appendix D.1 in the paper).
      • exp2 folder includes plots for missingness shift with a variable error rate in the train set and a fixed error rate in the test set (Section 5 and Appendix D in the paper).
      • exp3 folder includes plots for missingness shift with a fixed error rate in the train set and a variable error rate in the test set (Section 5 and Appendix D in the paper).
    • Scatter_Plots.ipynb notebook includes scatter plots for single-mechanism and multi-mechanism missingness colored by null imputers and shaped by datasets (Section 4.4 and Appendix C.2 in the paper).
    • Time_Consumption.ipynb notebook includes tables with training time for each imputer for single-mechanism and multi-mechanism missingness (Appendix C.1 in the paper).
    • Correlations.ipynb notebook includes plots for Spearman correlation between MVI technique, model type, test missingness, and performance metrics (F1, fairness and stability) for different train missingness mechanisms (Section 6 and Appendix E in the paper).

MVI techniques

Below we summarize a list of MVI techniques available in our benchmark, their source repos, papers and links to our adaptation in code.

Name Category Source Repo Source Paper Adapted Implementation
MissForest Machine Learning-based GitHub Stekhoven, D. J. & Bühlmann, P. (2011). MissForest—nonparametric missing value imputation for mixed-type data. Code
K-Means Clustering Machine Learning-based GitHub Gajawada, S. & Toshniwal, D. (2012). Missing value imputation method based on clustering and nearest neighbours. Code
DataWig Discriminative Deep Learning-based GitHub Biessmann, F. et al. (2019). DataWig: Missing value imputation for tables. Code
AutoML Discriminative Deep Learning-based GitHub Jäger, S., Allhorn, A. & Bießmann, F. (2021). A benchmark for data imputation methods. Code
GAIN Generative Deep Learning-based GitHub Yoon, J., Jordon, J. & Schaar, M. (2018). Gain: Missing data imputation using generative adversarial nets. Code
HI-VAE Generative Deep Learning-based GitHub Nazabal, A., Olmos, P. M., Ghahramani, Z. & Valera, I. (2020). Handling incomplete heterogeneous data using VAEs. Code
not-MIWAE MNAR-specific GitHub Ipsen, N. B., Mattei, P.-A. & Frellsen, J. (2020). not-MIWAE: Deep generative modelling with missing not at random data. Code
GINA (MNAR-PVAE) MNAR-specific GitHub Ma, C. & Zhang, C. (2021). Identifiable generative models for missing not at random data imputation. Code
NOMI Most Recent GitHub Wang, J. et al. (2024). Missing Data Imputation with Uncertainty-Driven Network. Code
TDM Most Recent GitHub Zhao, H., Sun, K., Dezfouli, A. & Bonilla, E. V. (2023). Transformed distribution matching for missing value imputation. Code
EDIT Most Recent Shared by authors privately Miao, X. et al. (2021). Efficient and effective data imputation with influence functions.

Usage

MVI technique evaluation

This console command evaluates single or multiple null imputation techniques on the selected dataset. The argument evaluation_scenarios defines which evaluation scenarios to use. Available scenarios are listed in configs/scenarios_config.py, but users have an option to create own evaluation scenarios. tune_imputers is a bool parameter whether to tune imputers or to reuse hyper-parameters from NULL_IMPUTERS_HYPERPARAMS in configs/null_imputers_config.py. save_imputed_datasets is a bool parameter whether to save imputed datasets locally for future use. dataset and null_imputers arguments should be chosen from supported datasets and techniques. run_nums defines run numbers for different seeds, for example, the number 3 corresponds to 300 seed defined in EXPERIMENT_RUN_SEEDS in configs/constants.py.

python ./scripts/impute_nulls_with_predictor.py \
    --dataset folk \
    --null_imputers [\"miss_forest\",\"datawig\"] \
    --run_nums [1,2,3] \
    --tune_imputers true \
    --save_imputed_datasets true \
    --evaluation_scenarios [\"exp1_mcar3\"]

Model evaluation

This console command evaluates single or multiple null imputation techniques along with ML models training on the selected dataset. Arguments evaluation_scenarios, dataset, null_imputers, run_nums are used for the same purpose as in impute_nulls_with_predictor.py. models defines which ML models to evaluate in the pipeline. ml_impute is a bool argument which decides whether to impute null dynamically or use precomputed saved datasets with imputed values (if they are available).

python ./scripts/evaluate_models.py \
    --dataset folk \
    --null_imputers [\"miss_forest\",\"datawig\"] \
    --models [\"lr_clf\",\"mlp_clf\"] \
    --run_nums [1,2,3] \
    --tune_imputers true \
    --save_imputed_datasets true \
    --ml_impute true \
    --evaluation_scenarios [\"exp1_mcar3\"]

Baseline evaluation

This console command evaluates ML models on clean datasets (without injected nulls) for getting baseline metrics. Arguments follow same logic as in evaluate_models.py.

python ./scripts/evaluate_baseline.py \
    --dataset folk \
    --models [\"lr_clf\",\"mlp_clf\"] \
    --run_nums [1,2,3]

Extending the benchmark

Adding a new dataset

  1. To add a new dataset, you need to use Virny wrapper BaseFlowDataset, where reading and basic preprocessing take place (link to documentation).
  2. Create a config yaml file in configs/yaml_files with settings for the number of estimators, bootstrap fraction and sensitive attributes dict like in example below.
dataset_name: folk
bootstrap_fraction: 0.8
n_estimators: 50
computation_mode: error_analysis
sensitive_attributes_dct: {'SEX': '2', 'RAC1P': ['2', '3', '4', '5', '6', '7', '8', '9'], 'SEX & RAC1P': None}
  1. In configs/dataset_config.py, add a newly created wrapper for your dataset specifing kwarg arguments, test set fraction and config yaml path in the DATASET_CONFIG dict.

Adding a new ML model

  1. To add a new model, add the model name to MLModels enum in configs/constants.py.
  2. Set up a model instance and hyper-parameters grid for tuning inside the function get_models_params_for_tuning in configs/models_config_for_tuning.py. Model instance should inherit sklearn BaseEstimator from scikit-learn in order to support logic with tuning and fitting model (link to documentation).

Adding a new null imputer

  1. Create a new imputation method for your imputer in source/null_imputers/imputation_methods.py similar to:
def new_imputation_method(X_train_with_nulls: pd.DataFrame, X_tests_with_nulls_lst: list,
                          numeric_columns_with_nulls: list, categorical_columns_with_nulls: list,
                          hyperparams: dict, **kwargs):
    """
    This method imputes nulls using the new null imputer method.
    
    Arguments:
        X_train_with_nulls -- a training features df with nulls in numeric_columns_with_nulls and categorical_columns_with_nulls columns
        X_tests_with_nulls_lst -- a list of different X test dfs with nulls in numeric_columns_with_nulls and categorical_columns_with_nulls columns
        numeric_columns_with_nulls -- a list of numerical column names with nulls
        categorical_columns_with_nulls -- a list of categorical column names with nulls
        hyperparams -- a dictionary of tuned hyperparams for the null imputer
        kwargs -- all other params needed for the null imputer
    
    Returns:
        X_train_imputed (pd.DataFrame) -- a training features df with imputed columns defined in numeric_columns_with_nulls
                                          and categorical_columns_with_nulls
        X_tests_imputed_lst (list) -- a list of test features df with imputed columns defined in numeric_columns_with_nulls 
                                         and categorical_columns_with_nulls
        null_imputer_params_dct (dict) -- a dictionary where a keys is a column name with nulls, and 
                                          a value is a dictionary of null imputer parameters used to impute this column
    """
    
    # Write here either a call to the algorithm or the algorithm itself
    ...
    
    return X_train_imputed, X_tests_imputed_lst, null_imputer_params_dct
  1. Add the configuration of your new imputer to configs/null_imputers_config.py to the NULL_IMPUTERS_CONFIG dictionary.
  2. Add your imputer name to the ErrorRepairMethod enum in configs/constants.py.
  3. [Optional] If a standard imputation pipeline does not work for a new null imputer, add a new if-statement to source/custom_classes/benchmark.py to the _impute_nulls method.

Adding a new evaluation scenario

  1. Add a configuration for the new missingness scenario and the desired dataset to the ERROR_INJECTION_SCENARIOS_CONFIG dict in configs/scenarios_config.py. Missingness scenario should follow the structure below: missing_features are columns for null injection, and setting is a dict, specifying error rates and conditions for error injection.
ACS_INCOME_DATASET: {
    "MCAR": [
        {
            'missing_features': ['WKHP', 'AGEP', 'SCHL', 'MAR'],
            'setting': {'error_rates': [0.1, 0.2, 0.3, 0.4, 0.5]},
        },
    ],
    "MAR": [
        {
            'missing_features': ['WKHP', 'SCHL'],
            'setting': {'condition': ('SEX', '2'), 'error_rates': [0.08, 0.12, 0.20, 0.28, 0.35]}
        }
    ],
    ...
}
  1. Create a new evaluation scenario with the new missingness scenario in the EVALUATION_SCENARIOS_CONFIG dict in configs/scenarios_config.py. A new missingness scenario can be used alone or combined with others. train_injection_scenario and test_injection_scenarios define settings of error injection for train and test sets, respectively. test_injection_scenarios takes a list as an input since the benchmark has an optimisation for multiple test sets.
EVALUATION_SCENARIOS_CONFIG = {
    'mixed_exp': {
        'train_injection_scenario': 'MCAR1 & MAR1 & MNAR1',
        'test_injection_scenarios': ['MCAR1 & MAR1 & MNAR1'],
    },
    'exp1_mcar3': {
        'train_injection_scenario': 'MCAR3',
        'test_injection_scenarios': ['MCAR3', 'MAR3', 'MNAR3'],
    },
    ...
}

About

Studying the impact of data cleaning techniques on fairness and stability

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •