Skip to content

wssun/EliBadCode

Repository files navigation

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Preparation

Setup Environments

  1. Install Anaconda Python https://www.anaconda.com/distribution/

  2. conda create --name EliBadCode python=3.8 -y (help)

  3. conda activate EliBadCode

    1. conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
    2. pip install transformers==4.33.2

Download Dataset

CodeXGlue dataset can be downloaded through the following links: https://github.com/microsoft/CodeXGLUE

Dataset Preprocessing

python preprocess.py

Usage

  1. Construct poison dataset

    python poison.py
  2. Obtain a backdoored model

    \\CodeBERT
    
    cd attacks/Defect_Detection/codebert
    
    python run.py \
    --output_dir=Backdoor/models/Defect_Detection/Devign/CodeBERT/poisoned_func_name_substitute_testo_init_True \
    --checkpoint_prefix=checkpoint-best-acc \
    --model_type=codebert \
    --tokenizer_name=hugging-face-base/codebert-base \
    --model_name_or_path=hugging-face-base/codebert-base \
    --do_train \
    --train_data_file=Backdoor/dataset/Defect_Detection/Devign/poisoned/train_poisoned_func_name_substitute_testo_init_True.jsonl \
    --eval_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/valid.jsonl \
    --test_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/test.jsonl \
    --epoch 5 \
    --block_size 400 \
    --train_batch_size 32 \
    --eval_batch_size 64 \
    --learning_rate 2e-5 \
    --max_grad_norm 1.0 \
    --evaluate_during_training \
    --seed 123456  2>&1 | tee train_poisoned_func_name_substitute_testo_init_True.log
  3. Run EliBadCode on a backdoored model

    cd defense/ours
    python run.py
    
    cd ../unlearning
    python run.py

Hyperparameters

For different tasks, we fine-tune CodeBERT, CodeT5, and UniXcoder according to the different settings provided in CodeXGLUE. Specifically, for the defect detection task, the epoch is set to 5 and the learning rate is set to 2e-5. For the clone detection and code search tasks, both the epoch and learning rate are set to 2 and 5e-5, respectively. All the models are trained using the Adam optimizer. All of our experiments are implemented in PyTorch 1.13.1 and Transformers 4.38.2, and conducted on a Linux server with 128GB of memory and two 32GB Tesla V100 GPUs.

Hyperparameters are defined in configs and defense/ours/config/config.yaml. Here we list several critical parameters and describe their usages.

  1. trigger_len: Number of tokens inverted during optimization
  2. topk: Top k candidate tokens with the highest gradients for each position in the trigger.
  3. repeat_size: The number of candidate triggers generated.

About

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages