This is the official github repository for FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets.
[July 21, 2023] Initial release: We released the first version of FLASK! Also check out the interactive demo.
This is the repository for the FLASK project, a task-agnostic evaluation protocol of language model with fine-grained instance-wise skill set as metrics.
This code repository contains the implementation of model-based implementation of FLASK. For guidelines for human-based evaluation, refer to the paper.
Since our model-based evaluation is based on GPT-4 evaluation, you need to add your OpenAI API key into the file of openai_info/api_info.json
where you need to add your key in api_keys
list. Note that we also support multiple keys for faster evaluation.
If you want to inference a model on the evaluation set of FLASK, run the following command.
cd model_output
python inference.py --model-path {model_path} --model-id {model_id} --question-file ../input_data/flask_evaluation_raw.jsonl --answer-file {output_directory} --num-gpus {num_gpus}
We provide the inference results of various LLMs on model_output/outputs
directory. Note that for the inference of FLASK-Hard, you can simply replace the --question-file
argument to ../evaluation_set/flask_hard_evaluation.jsonl
.
After inference, we can evaluate the model using FLASK evaluation protocol. Run the following command for evaluation.
cd gpt_review
python gpt4_eval.py -q '../evaluation_set/flask_evaluation.jsonl' -a {answer_file} -o {output_review_file} -e {output_error_file}
Note that the error file exists for instances that fail due to rate limit of OpenAI API. If the error file is created after inference, you can rerun only the error instances by running the following command:
cd gpt_review
python gpt4_eval.py -q {output_error_file} -a {answer_file} -o {output_review_file} -e {new_output_error_file}
We provide the GPT-4 evaluation result of various models in gpt_review/outputs
directory.
After evaluation, FLASK enables fine-grained analysis depending on the skills, domains, and the difficulty levels.
For analyzing the performance for each skill
, run the following command:
cd gpt_review
python aggregate_skill.py -m {output_review_file}
For analyzing the performance for each skill depending on the difficulty
, run the following command:
cd gpt_review
python aggregate_difficulty_skill.py -m {output_review_file}
For analyzing the performance for each domain
, run the following command:
cd gpt_review
python aggregate_domain.py -m {output_review_file}
We also provide the implementation of the process of automatic metadata annotation of FLASK.
Since our model-based evaluation is based on GPT-4 evaluation, you need to add your OpenAI API key into the file of openai_info/api_info.json
where you need to add your key in api_keys
list. Note that we also support multiple keys for faster evaluation.
For domain metadata annotation, run the following command:
cd metadata_annotation/domain
python domain_annotation.py -o {output_domain_annotation} -e {output_domain_annotation_error}
We define 10 different domains revising the domain categorization of Wikipedia. Note that the error file exists for instances that fail due to rate limit of OpenAI API. If the error file is created after inference, you can rerun only the error instances.
For skillset annotation, run the following command:
cd metadata_annotation/skillset
python skillset_annotation.py -q {output_domain_annotation} -o {output_skill_annotation} -e {output_skill_annotation_error}
We define 12 skills for fine-grained evaluation of LLMs.
For difficulty level annotation, run the following command:
cd metadata_annotation/difficulty
python difficulty_annotation.py -q {output_skill_annotation} -o {output_difficulty_annotation} -e {output_difficulty_annotation_error}
We define 5 different difficulty levels depending on the domain knowledge. Note that after Step 4, each row of the output_difficulty_annotation
file should have the same format as the row of evaluation_set/flask_evaluation.jsonl
.
Check out the interactive demo!
Seonghyeon Ye*, Doyoung Kim*, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, and Minjoon Seo.
(* denotes equal contribution)
We release the evaluation code of the FLASK. We also plan to release the pip version of FLASK including analysis code in the near future. Stay Tuned!
Please cite if you use the data or code in this repo.
@misc{ye2023flask,
title={FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets},
author={Seonghyeon Ye and Doyoung Kim and Sungdong Kim and Hyeonbin Hwang and Seungone Kim and Yongrae Jo and James Thorne and Juho Kim and Minjoon Seo},
year={2023},
eprint={2307.10928},
archivePrefix={arXiv},
primaryClass={cs.CL}
}