-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to evauate multiple choice tasks #5047
Conversation
Nice work I think json would be the best choice for all those kind of benchmarks. I'd also not include it into llama.cpp directly but it could be an optional dependency, could be added by a #define flag, so only the actual benchmarking tool(s) have it included |
The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right.
I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively.
109dada
to
d86f80f
Compare
I knew someone will bring up json :-) The code that reads the binary data is 24 lines in 3 functions. In comparison, nlohmann's The way it is now it handles ARC, MMLU, TruthfulQA, and HellaSwag. I have posted test/validation datasets in https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp. It could theoretically also handle Winogrande, but I would live that one to a separate implementation due to the slightly different probability evaluation (it uses partial evaluation). Converting to json is trivial: just copy/paste the 24 LOC into a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binary format is great, no need for json
I had forgotten that MSVC does not make constexpr's available inside a lambda.
It was no complaint My reasoning is that a binary format can not be viewed, it needs a special viewer someone has to write. People might want to use their custom benchmarks for example. JSON means anyone can just start working on it, bin means you've to have a much higher level of competence (currently no compiler/converter available ?) |
Could you make and upload a ARC-Mix .bin please, @ikawrakow ? |
Done.
This is a feature, not a bug :-)
I have added a simple demo program to the repository that uses
|
Also, noob question : is it possible to chain test several benchs (or even perplexity calculations at different ctx) via several commands without reloading the model? |
Now it's a feature :-) |
* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Just want to see if I'm missing anything - I would like to do exactly as mentioned above, by making a custom benchmark. Is the above only to convert .bin -> JSON, and if so, is there a way to go back to .bin? My use-case is simply that many public benchmarks have contaminated training data, and a custom benchmark could be more relevantly tailored to my usages. |
Somebody needs to modify the tool to be able to load JSON, store the data into the |
Understood, and thanks. |
* TruthfulQA: 1st attempt, does not look like it is working The same implementation can be used for HellaSwag as well, so I converted a HellaSwag validation dataset to the binary format used here and tested with that. The score is only around 50, so something is not quite right. * TruthfulQA: works but the result is bad I know it works because if I convert the HellaSwag validation data to the binary format used in the truthful_qa_score() function I get the exact same result as from the hellaswag_score() function. But I guess, the questions are tricky and the way I have done the combination of question + answer is very likely not the best. The TruthfulQA validation dataset contains 817 questions, with random chance result around 19%. With this version I get 29.1% for Mistral-7B and 55.2% for Mistral-7B-Instruct-v0.2. The HF leader board results for these two models are 42.2% and 68.3%, respectively. * TruthfulQA: fix random sample * TruthfulQA: prepare tasks in parallel for large test datasets * Rename truthful_qa to multiple_choice * Make MSVC happy I had forgotten that MSVC does not make constexpr's available inside a lambda. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
For anyone who comes across this topic, I've managed to put together a JSON->bin encoder. I've tested it both with my own multi-choice questions, and by decoding and re-encoding a .bin file and getting the same results back. If you're interested in making your own multiple-choice benchmarks, please see the gist here.
I've also included a simple text -> JSON formatter to make the process easier.
@ikawrakow Please feel free to include either script in the Readme of your repo, the repo itself, or even as part of convert.cpp, if you wish to. |
Summary
This PR adds the ability to run multiple-choice-single-correct-answer type of LLM benchmarks with the
perplexity
tool.Details
Commonly used LLM benchmarks of this type are
Although the HellaSwag test asks for continuing a given context with one of 4 possible endings, it is basically the same as finding the highest probability answer (as per LLM predicted logits) to a multiple-choice question. Hence, it can be done with the exact same evaluation approach as ARC, MMLU, and Truthful QA (and the
multiple_choice_score()
function added by the PR achieves the exact same HellSwag score as the existing implementation inhellaswag_score()
).I have posted validation datasets for all of the above benchmarks in this Huggingface repository. A very simple binary format is used for these datasets and in the implementation in this PR.
The results I'm getting with this implementation are not quite the same as we find on the Huggingface Open LLM Leaderboard (HFLB), see table below. This can be due to several reasons:
Nevertheless, I think that it is useful to add this capability in the present from to
llama.cpp
to allow experimentation. Perhaps this will lead to better approaches and scores matching better HFLB.The following table summarizes results for Mistral-7B-Instruct-v0.2. The full
fp16
model is used, and the calculations are run on an RTX-4080 GPU.Note: I'm assuming that ARC on HFLB is an even mix of ARC-Easy and ARC-Challenge.
In this implementation, the prompts passed for evaluation for ARC, MMLU and TruthfulQA are in the form
and the probability for each answer is computed from the tokens in
answer_body
. I did experiment with several variations, but the above gave the highest score. Somewhat surprisingly, I did not get a higher score for Mistral-7B-Instruct-v0.2 usingGiven this, and the fact that the former is LLM/tuning agnostic, I have prepared the datasets in this form. Obviously one can change to having the bare question/answers stored in the dataset, and add question prefix/suffix via command line arguments.
Usage
Without the
--multiple-choice-tasks
argument, or withN = 0
, orN >= number of tasks
, all tasks<some_data_file>
will be run consecutively, else a random sample ofN
tasks will be selected.It woks, but the scores I'm getting are much lower compared to HFLB leader board:29.1% vs 42.2% on HFLB for Mistral-7B55.2% vs 68.3% on HFLB for Mistral-7B-Instruct-v0.2I know the implementation is correct because the same function that is used to evaluate the TruthfulQA score can be also used for HellaSwag if one converts the HellaSwag dataset to the binary format I use for TruthfulQA, and I get the exact same HellaSwag score as in the existing HellSwag implementation.The implementation uses the same batched implementation as it is now used for HellaSwag and Winogrande, and needs just 9 seconds to process the 817 validation dataset tasks.I'm combining the question and each answer asQuestion: "question goes here" Answer: answer goes here
. I guess, this is not the best way, but I didn't find a variation that works better (produces higher scores), but it definitely works better than just concatenating the question with each multiple choice answer.Why binary format for this test? Because, unlike HellSwag's line-oriented text data, the format can handle multiple choice questions with arbitrary number of answers along with a single correct answer or multiple correct answers, without adding a massive dependency on a Parquet (format used on Huggingface) or JSON parsing libraries tollama.cpp
.