Phenomena, Mechanism, and Mitigation
Annotation: The data that is used for annotated in our study is stored in the annotation folder.
Datasets: To begin, download the evaluation datasets from datasets and extract them into the /dataset folder. In this experiment, we use the CoderEval dataset.
Repository: To begin, download the practical repositories from CoderEval and extract them into the /repos folder and /CoderEval/repos.
Models: In the mitigation experiment, we employ CodeGen, Pangu-α, ChatGPT, DeepSeekCoder, CodeLlama, and StarCoder2. Among the open-source models, we obtain the model through HuggingFace and conduct experiments. The closed-source model ChatGPT, which we experiment with using the OpenAI API interface.
Experimental Results: The experimental results are shown in the /testing-CoderEval/model_name folder. The experimental results in the file prediction_r0.jsonlare based on the Raw method, and the experimental results in the file prediction_r1.jsonl are based on the RAG-based method.
⚠ If you want to reproduce the results from scratch, please follow these steps:
Set-Up: Before starting the following process, it's essential to set up your environment by installing the necessary dependencies listed in the requirements.txt file. To install these dependencies, activate your Python virtual environment and run:
pip install -r requirements.txtIn the mitigation experiment section of our paper, we use two methods: Raw method and RAG-based method. If you want to verify and try this experiment, you can refer to the following show an inference example on CodeGen-mono-350M as follows:
python eval_original.py \
--model=codegen \
--max_len=1024 \
--batch=4 For other backbone LLMs, You can refer to the structure in Model_Factor to customise the model you need to use. The structure is as follow:
MODEL_FACTORY = {
"codegen": ("Salesforce/codegen-350M-mono", init_codegen, 2048)
}For the settings of the function init_codegen, you can refer to the settings in the file model.py
def init_codegen(
model_name="Salesforce/codegen-350M-mono",
checkpoint=None,
additional_tokens=None,
device="cuda"
):
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, padding_side="left")
tokenizer.pad_token_id = tokenizer.eos_token_id
additional_tokens = [] if additional_tokens is None else additional_tokens
if len(additional_tokens) > 0:
tokenizer.add_tokens([AddedToken(t, rstrip=False, lstrip=False) for t in additional_tokens])
if checkpoint is None:
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))
else:
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.to(device)
return model, tokenizer