Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Preparation

Setup Environments

Install Anaconda Python https://www.anaconda.com/distribution/
conda create --name EliBadCode python=3.8 -y (help)
conda activate EliBadCode
1. conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
2. pip install transformers==4.33.2

Download Dataset

CodeXGlue dataset can be downloaded through the following links: https://github.com/microsoft/CodeXGLUE

Dataset Preprocessing

python preprocess.py

Usage

Construct poison dataset
```
python poison.py
```

Obtain a backdoored model

\\CodeBERT

cd attacks/Defect_Detection/codebert

python run.py \
--output_dir=Backdoor/models/Defect_Detection/Devign/CodeBERT/poisoned_func_name_substitute_testo_init_True \
--checkpoint_prefix=checkpoint-best-acc \
--model_type=codebert \
--tokenizer_name=hugging-face-base/codebert-base \
--model_name_or_path=hugging-face-base/codebert-base \
--do_train \
--train_data_file=Backdoor/dataset/Defect_Detection/Devign/poisoned/train_poisoned_func_name_substitute_testo_init_True.jsonl \
--eval_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/valid.jsonl \
--test_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/test.jsonl \
--epoch 5 \
--block_size 400 \
--train_batch_size 32 \
--eval_batch_size 64 \
--learning_rate 2e-5 \
--max_grad_norm 1.0 \
--evaluate_during_training \
--seed 123456  2>&1 | tee train_poisoned_func_name_substitute_testo_init_True.log

Run EliBadCode on a backdoored model

cd defense/ours
python run.py

cd ../unlearning
python run.py

Hyperparameters

For different tasks, we fine-tune CodeBERT, CodeT5, and UniXcoder according to the different settings provided in CodeXGLUE. Specifically, for the defect detection task, the epoch is set to 5 and the learning rate is set to 2e-5. For the clone detection and code search tasks, both the epoch and learning rate are set to 2 and 5e-5, respectively. All the models are trained using the Adam optimizer. All of our experiments are implemented in PyTorch 1.13.1 and Transformers 4.38.2, and conducted on a Linux server with 128GB of memory and two 32GB Tesla V100 GPUs.

Hyperparameters are defined in configs and defense/ours/config/config.yaml. Here we list several critical parameters and describe their usages.

trigger_len: Number of tokens inverted during optimization
topk: Top k candidate tokens with the highest gradients for each position in the trigger.
repeat_size: The number of candidate triggers generated.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
attacks		attacks
configs		configs
defense		defense
utils		utils
.DS_Store		.DS_Store
EliBadCode GCG-based Trigger Inversion on Code Search Task.pdf		EliBadCode GCG-based Trigger Inversion on Code Search Task.pdf
README.md		README.md
framework.png		framework.png
poison.py		poison.py
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Preparation

Setup Environments

Download Dataset

Dataset Preprocessing

Usage

Hyperparameters

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

wssun/EliBadCode

Folders and files

Latest commit

History

Repository files navigation

Eliminating Backdoors in Neural Code Models for Secure Code Understanding

Preparation

Setup Environments

Download Dataset

Dataset Preprocessing

Usage

Hyperparameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages