Skip to content

DeepSoftwareAnalytics/LLMCodingHallucination

Repository files navigation

LLM Hallucinations in Practical Code Generation:

Phenomena, Mechanism, and Mitigation

Annotation: The data that is used for annotated in our study is stored in the annotation folder.


Datasets: To begin, download the evaluation datasets from datasets and extract them into the /dataset folder. In this experiment, we use the CoderEval dataset.

Repository: To begin, download the practical repositories from CoderEval and extract them into the /repos folder and /CoderEval/repos.

Models: In the mitigation experiment, we employ CodeGen, Pangu-α, ChatGPT, DeepSeekCoder, CodeLlama, and StarCoder2. Among the open-source models, we obtain the model through HuggingFace and conduct experiments. The closed-source model ChatGPT, which we experiment with using the OpenAI API interface.

Experimental Results: The experimental results are shown in the /testing-CoderEval/model_name folder. The experimental results in the file prediction_r0.jsonlare based on the Raw method, and the experimental results in the file prediction_r1.jsonl are based on the RAG-based method.


⚠ If you want to reproduce the results from scratch, please follow these steps:

Set-Up: Before starting the following process, it's essential to set up your environment by installing the necessary dependencies listed in the requirements.txt file. To install these dependencies, activate your Python virtual environment and run:

pip install -r requirements.txt

Mitigation Experiment

In the mitigation experiment section of our paper, we use two methods: Raw method and RAG-based method. If you want to verify and try this experiment, you can refer to the following show an inference example on CodeGen-mono-350M as follows:

python eval_original.py \
    --model=codegen \ 
    --max_len=1024 \
    --batch=4 

For other backbone LLMs, You can refer to the structure in Model_Factor to customise the model you need to use. The structure is as follow:

MODEL_FACTORY = {
    "codegen": ("Salesforce/codegen-350M-mono", init_codegen, 2048)
}

For the settings of the function init_codegen, you can refer to the settings in the file model.py

def init_codegen(
    model_name="Salesforce/codegen-350M-mono",
    checkpoint=None,
    additional_tokens=None,
    device="cuda"
):
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, padding_side="left")
    tokenizer.pad_token_id = tokenizer.eos_token_id
    additional_tokens = [] if additional_tokens is None else additional_tokens
    if len(additional_tokens) > 0:
        tokenizer.add_tokens([AddedToken(t, rstrip=False, lstrip=False) for t in additional_tokens])
    if checkpoint is None:
        model = AutoModelForCausalLM.from_pretrained(model_name)
        model.resize_token_embeddings(len(tokenizer))
    else:
        model = AutoModelForCausalLM.from_pretrained(checkpoint)
    
    model.to(device)
    
    return model, tokenizer

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages