-
Install Anaconda Python https://www.anaconda.com/distribution/
-
conda create --name EliBadCode python=3.8 -y
(help) -
conda activate EliBadCode
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
pip install transformers==4.33.2
CodeXGlue dataset can be downloaded through the following links: https://github.com/microsoft/CodeXGLUE
python preprocess.py
-
Construct poison dataset
python poison.py
-
Obtain a backdoored model
\\CodeBERT cd attacks/Defect_Detection/codebert python run.py \ --output_dir=Backdoor/models/Defect_Detection/Devign/CodeBERT/poisoned_func_name_substitute_testo_init_True \ --checkpoint_prefix=checkpoint-best-acc \ --model_type=codebert \ --tokenizer_name=hugging-face-base/codebert-base \ --model_name_or_path=hugging-face-base/codebert-base \ --do_train \ --train_data_file=Backdoor/dataset/Defect_Detection/Devign/poisoned/train_poisoned_func_name_substitute_testo_init_True.jsonl \ --eval_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/valid.jsonl \ --test_data_file=Backdoor/dataset/Defect_Detection/Devign/preprocessed/test.jsonl \ --epoch 5 \ --block_size 400 \ --train_batch_size 32 \ --eval_batch_size 64 \ --learning_rate 2e-5 \ --max_grad_norm 1.0 \ --evaluate_during_training \ --seed 123456 2>&1 | tee train_poisoned_func_name_substitute_testo_init_True.log
-
Run
EliBadCode
on a backdoored modelcd defense/ours python run.py cd ../unlearning python run.py
For different tasks, we fine-tune CodeBERT, CodeT5, and UniXcoder according to the different settings provided in CodeXGLUE. Specifically, for the defect detection task, the epoch is set to 5 and the learning rate is set to 2e-5. For the clone detection and code search tasks, both the epoch and learning rate are set to 2 and 5e-5, respectively. All the models are trained using the Adam optimizer. All of our experiments are implemented in PyTorch 1.13.1 and Transformers 4.38.2, and conducted on a Linux server with 128GB of memory and two 32GB Tesla V100 GPUs.
Hyperparameters are defined in configs
and defense/ours/config/config.yaml
. Here we list several critical parameters and describe their usages.
trigger_len
: Number of tokens inverted during optimizationtopk
: Top k candidate tokens with the highest gradients for each position in the trigger.repeat_size
: The number of candidate triggers generated.