The IdentityChain Framework for Code Large Language Models (Code LLMs) Evaluation. Official implementation of the ICLR 2024 paper Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain.
The IdentityChain Framework evaluates the NL-to-PL (Code Generation) Accuracy, PL-to-NL (Code Summurization) Accuracy, and the Self-Consistency across the two tasks. It also provides a fine-grained analysis of the model's performance so that you can pinpoint the exact step and problem where the model makes a self-inconsistency violation.
Create and Activate a Conda Environment.
conda create -n idchain python=3.10
conda activate idchain
Install from PyPI with all Dependencies.
pip3 install identitychain
pip3 install -r requirements.txt
Install from Source with all Dependencies.
git clone https://github.com/marcusm117/IdentityChain.git
cd IdentityChain
make develop
Before the self-consistency evaluation, you need to make sure that one of the followings is satisfied:
- Your model is an Instruction-tuned Code LLM, and it's trained on both NL-to-PL and PL-to-NL tasks.
- Your model is a Foundation Code LLM, and it's trained on both completion and fill-in-the-middle tasks.
To evaluate your model using IdentityChain, you need to prepare the followings:
- An evaluation dataset from one of the followings (or one of your own in the same format):
- An NL-to-PL prompt for your model
- A PL-to-NL prompt for your model
- An NL-to-PL generation function for your model
- A PL-to-NL generation function for your model
See run_identity_chain_openai.py for an example of how to use IdentityChain to evaluate OpenAI models.
See run_identity_chain_google.py for an example of how to use IdentityChain to evaluate Google models.
See run_identity_chain_huggingface.py for an example of how to use IdentityChain to evaluate HuggingFace open-source models. This example script already includes the following models:
- CodeLlama-Instruct-hf (7B, 13B, 34B, 70B)
- CodeLlama-hf (7B, 13B, 34B, 70B)
- StarChat-Beta
- StarCoder
- StarCoderPlus
- StarCoderBase (1B, 3B, 7B, 15B)
- DeepSeekCoder-Instruct (1.3B, 6.7B, 33B, 7B-v1.5)
- DeepSeekCoder (1.3B, 6.7B, 33B, 7B-v1.5)
Use run_identity_chain.sh to execute scripts run_identity_chain_openai.py or run_identity_chain_huggingface.py, which conducts several IdentityChain evaluation in a batch. Make sure that you modify the followings before running the script:
export CUDA_VISIBLE_DEVICES=0
to specify the local GPU device you want to useexport HF_HOME=YOUR_OWN_PATH/huggingface
to specify your own huggingface home path, where the model checkpoints will be cachedexport IDENTITY_CHAIN_HOME=YOUR_OWN_PATH/IdentityChain
to your own IdentityChain home path- other parameters in the script for your own needs
Then run the script:
cd examples
bash run_identity_chain.sh
This script will create a temporary folder tmp
under your IdentityChain home path, and store the results of IdentityChain evaluation in this folder, which will be a jsonl
file. For example, tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl
.
Use analyze_results.py to analyze the results of IdentityChain evaluation. It will geneartes an xlsx
file, which contains the following information:
- The SC (Self-Consistency) and SSC (Strong Self-Consistency) scores of the model at each self-iteration step. Note that SSC_0 is just Pass@1
- The aggregated TOM score (also BLEU and CodeBLEU) information at each step for the following 4 types of resulsts: Pass-Pass, Pass-Fail, Fail-Fail, Fail-Pass
- The TOM score (also BLEU and CodeBLEU) trajectory at each self-iteration step for each sample in the eavluation set.
- The raw test case outputs at each self-iteration step
cd ../scripts
python analyze_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5
The analyzed results will give you a sense of the model's overall performance, and the TOM score trajectory will help you pinpoint the exact step where the model makes a mistake.
Use browse_results.py to browse the results of IdentityChain evaluation. You can use this script to manually examine and study the mistakes made by the model for specific samples.
cd ../scripts
python browse_results.py --input_path ../tmp/starcoderbase-1b/IDChain_starcoderbase-1b_tmp0.0g_len5_pb_all_m_v1_EvalPlus-Mini-v0.1.6_reformatted.jsonl --chain_length 5 --start 0
We use a Makefile
as a command registry:
make format
: autoformat this library withblack
make lint
: perform static analysis of this library withblack
andflake8
make annotate
: run type checking usingmypy
make test
: run automated testsmake check
: check assets for packaging
Make sure that make lint
, make test
, and make check
all pass locally before submitting a Pull Request.