Skip to content

[ICML 2025 Spotlight] CACTI recycles missing value patterns present in the data and unstructued feature context information to achieve SOTA imputation accuracy across various missingness conditionss

License

Notifications You must be signed in to change notification settings

sriramlab/CACTI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌡 CACTI: Context Aware Copy Masked Tabular Imputation

release: v0.1.0 python License: GPL v3 PRs docs: in-progress

This is the offical repo for implemetation of the Context Aware Copy masked Tabular Imputation (CACTI). The code here will allow users to run CACTI on any dataset and also reproduce its benchmakring results in the ICML'25 spotlight paper: "CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation."

🌡 CACTI Overview

CACTI!

CACTI samples observed missingness patterns to generate masks via Median Truncated Copy Masking (MT-CM) to guide the learning. Features' context are also embedded with a language model. The MT-CM strategy masks out some portion of the observed features from sample $n$ using observed missingness patterns from other samples ($j$) in the same dataset. This is followed by concatenating context information to the remaining (unmasked) features. A transformer encoder processes this data. Then the model adds context information and [MASK] tokens for the missing/masked features before being processed by the decoding transformer which reconstructs the values. \MName optimizes reconstruction loss ($\mathcal{L}_{CACTI}$) over observed and masked features to produce the final imputation estimates.

MTCM!

In contrast to random masking, where some subset of random features are masked, copy masking recycles missing-value patterns actually present in the dataset. This approach simulates realistic missingness patterns that provide a source of useful inductive bias during training. Median Truncated Copy Masking extends this strategy for MAE training by truncating the number of features available to the encoder, ensuring it has access to at most the median number of fully observed features in each batch.

🐍 Installation

Please install the following conda environment to run CACTI:

conda env create -n cacti -f environment.yml
conda activate cacti
pip install ucimlrepo==0.0.7

# If you would like to create simulated missingness mask 
# install the following as well;
# However, note that this will download a whole host of 
# anciliary packages from HyperImpute requirements.txt
pip install hyperimpute

πŸ—„οΈ Generating UCI ML Datasets

Run the following create all 10 UCI ML dataset, simmulate the 3 missingness scenarios, make train/test splits:

python tools/createsplits_ucidata.py --data all \
    --mechanism all --pmiss 0.3 \
    --pval 0.2 --nrep 3 \
    --odir <DATA_DIR>

Next, we will extract and embed the column infomation/context into embeddings (this is a one time process):

for DS in bike obesity california default income letter magic shoppers spam students; do 
    python tools/precompute_colembed.py \
        --desc_file <DATA_DIR>/uciml/${DS}/${DS}-mcar-30/colinfo.tsv \
        --out <DATA_DIR>/uciml/${DS}/ \
        --mode BERT --embed_model Alibaba-NLP/gte-en-mlm-large \
        --fname embeddings_colnames_GTEenMLMlarge; 
done

✨ Usage

To train and make inference on training dataset only use the following command stcuture:

python train.py --project <PROJECT_NAME> --model CACTI \
            --checkpoint_id CACTI-cpkt \
            --tabular <DATA_DIR>/uciml/obesity/obesity-mnar-30/ptrain-1.tsv \
            --embeddings <DATA_DIR>/uciml/obesity/embeddings_colnames_GTEenMLMlarge.npz \
            --log_path <LOG_DIR> \
            --mask_ratio 0.9 --batch_size 128 --epochs 300 \
            --num_workers 4 --gpus 0 --lr 0.001 \
            --save_path <DATA_DIR>/CACTI-res/<PROJECT_NAME>/CACTI-cp90-ne10-nd4-es64/obesity/obesity-mnar-30  \
            --embed_dim 64 --nencoder 10 --ndecoder 4

To train and make inference on both training and test datasetsuse the following command stcuture:

python train.py --project <PROJECT_NAME> --model CACTI \
            --checkpoint_id CACTI-obesity-mnar-30-1-cpkt \
            --tabular <DATA_DIR>/uciml/obesity/obesity-mnar-30/ptrain-1.tsv \
            --tabular_infer <DATA_DIR>/uciml/obesity/obesity-mnar-30/pval-1.tsv \
            --embeddings <DATA_DIR>/uciml/obesity/embeddings_colnames_GTEenMLMlarge.npz \
            --log_path <LOG_DIR> \
            --mask_ratio 0.9 --batch_size 128 --epochs 300 \
            --num_workers 4 --gpus 0 --lr 0.001 \
            --save_path <DATA_DIR>/CACTI-res/<PROJECT_NAME>/CACTI-cp90-ne10-nd4-es64/obesity/obesity-mnar-30  \
            --embed_dim 64 --nencoder 10 --ndecoder 4

The following bash script can be used to iterate through all missingness condtions for a particualar dataset:

dataset=obesity # or any of : (bike obesity california default income letter magic shoppers spam students)
scenarios=(mar mcar mnar)
miss_rate=30

# Dir Paths
PROJECT=<PROJECT_NAME> # eg. CACTI-bmk
DATA_DIR=<DATA_DIR>/uciml/${dataset}/
LOG_DIR=<LOG_DIR>/${PROJECT}/
SAVE_DIR=<DATA_DIR>/CACTI-res/${PROJECT}/

# Run Settings
CPRATE=0.90
CPPCT=$(printf "%.0f" "$(echo "$CPRATE * 100" | bc)")
BSIZE=128
EPOCHS=600
EMBED=GTEenMLMlarge
model=CACTI
NENC=10
NDEC=4
EMD=64

for mtype in "${scenarios[@]}"; do
    for rep_index in 1 2 3; do
        ### CACTI : Main benchmark runs
        mname=${model}-cp${CPPCT}-ne${NENC}-nd${NDEC}-es${EMD}
        ckpt_id=${mname}-${dataset}-${mtype}-${miss_rate}-${rep_index}
        echo "SGE_TASK_ID :: MODEL ${PROJECT} ${mname} (cp:${CPPCT}, batch:${BSIZE}, epochs:${EPOCHS}): ${SGE_TASK_ID}, dataset: ${dataset}, scenario: ${mtype}, miss_rate: ${miss_rate}, rep_index: ${rep_index}, ctx_embeddig ${EMBED}"
        
        # Train+eval
        python train.py --project ${PROJECT} --model ${model} \
            --checkpoint_id ${ckpt_id} \
            --tabular ${DATA_DIR}/${dataset}-${mtype}-${miss_rate}/ptrain-${rep_index}.tsv \
            --tabular_infer ${DATA_DIR}/${dataset}-${mtype}-${miss_rate}/pval-${rep_index}.tsv \
            --embeddings ${DATA_DIR}/embeddings_colnames_${EMBED}.npz --log_path ${LOG_DIR} \
            --mask_ratio ${CPRATE} --batch_size ${BSIZE} --epochs ${EPOCHS} \
            --num_workers 4 --gpus 0 --lr 0.001 --save_path ${SAVE_DIR}/${mname}/${dataset}/${dataset}--${mtype}-${miss_rate} \
            --embed_dim ${EMD} --nencoder ${NENC} --ndecoder ${NDEC}
    done
done

πŸ”¬ Calculate imputation metrics

To calculate the final imputation performacne metrics use the following script:

# assuming results are saved under <DATA_DIR>/CACTI-res/<PROJECT_NAME>/<METHOD_NAME>

python tools/uci_scoreeval.py --data_path <DATA_DIR>/uciml/ \
    --impute_path <DATA_DIR>/CACTI-res/<PROJECT_NAME> \
    --output_path <OUTPUT_DIR> \
    --num_replicates 3 \
    --method CACTI-cp90-ne10-nd4-es64     

πŸ“š Citation

If you use this software in your research or find some of the key contributions of this work helpful, please cite our work as follows:

@inproceedings{gorla2025cacti,
  title = {CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation},
  author = {Aditya Gorla and Ryan Wang and Zhengtong Liu and Ulzee An and Sriram Sankararaman},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning (ICML)},
  year = {2025}
}

About

[ICML 2025 Spotlight] CACTI recycles missing value patterns present in the data and unstructued feature context information to achieve SOTA imputation accuracy across various missingness conditionss

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages