This is a tool for evaluating FIM code generation for generating and evaluating tasks on datasets in four languages: java/python/cpp/jsvascript.
For Java, Python, CPP, and JSVASCRIPT languages, to provide the upper and lower information of the code block, you need to predict the middle filling. Evaluation metrics include:Exact Match、BLEU-4、CODE-BLEU、Length(Pred/Ref)
- Exact Match
- BLEU-4
- CODE-BLEU
- Length(Pred/Ref)
To run the model inference and evaluatiaon code, you'll need the following environment setup:
- Python 3.8 or higher
- PyTorch 2.1.0 or higher
- sentencepiece 0.2.0 or higher
- transformers 4.34.1 or higher (if run inference by transformers library)
Please ensure all dependencies are installed using the following command:
conda create -n aixcoder-evaluation python=3.11
conda activate aixcoder-evaluation
pip install -r requirements.txtrequirements.txt listed all necessary libraries and their versions.
To achieve faster inference speeds, especially for large models, we recommend installing flash attention. Flash attention is an optimized attention mechanism that significantly reduces computation time for transformer-based models without sacrificing accuracy.
Before proceeding, ensure your environment meets the CUDA requirements as flash attention leverages GPU acceleration. Follow these steps to install flash attention:
git clone git@github.com:Dao-AILab/flash-attention.git
cd flash-attention
MAX_JOBS=8 python setup.py installcd datasets
tar zxvf *.tar.gzHere's an example of a generate task. python run_inference.py --model aiXcoder/aixcoder-7b-base --language java
--modelmodel name on huggingface,
- deepseek-ai/deepseek-coder-6.7b-base
- aiXcoder/aixcoder-7b-base
- codellama/CodeLlama-7b-hf
- bigcode/starcoder2-7b
You can also set the model weight file that has been downloaded locally
-
--languageDataset language- Support Python Java Cplus JavaScript four languages, you can set a language separately, you can also set multiple languages at the same time, and multiple languages are separated by spaces
-
--output_dirThe output path of the generated result is saved in the output_dir folder in the current directory by default -
--deviceSet the cuda used, default cuda -
--torch_dtypeSet the precision, default bf16, can be set to:"fp32", "fp16", "bf16" -
--attn_implementationThe setting uses FlashAttention, default True, if you don't support FlashAttention, set this to False -
--gen_lenSet max generate length, default 512 -
--max_lenSetmax_new_tokens, default 16384
Here's an example of a evaluate task. python run_evaluate.py
-
--languageThe language to be evaluated- Support Python Java Cplus JavaScript four languages, you can set a language separately, you can also set multiple languages at the same time, and multiple languages are separated by spaces
-
--result_pathBy default, the output path of the evaluation results is stored in the output_dir folder in the current directory Two files are generated with the suffix _scored.jsonl and _statistics.txt The results of each assessment for each Task Type and the average of the total results are recorded in the _statistics.txt