Replies: 14 comments 34 replies
-
A wider view of progress over time using the HF leaderboard data. I curled the Conclusion: Continuous and continuing progress upwards with the 7B models improving twice as fast as 65B, likely due to compute availability. |
Beta Was this translation helpful? Give feedback.
-
The scores of the LLaMA models on the Open LLM Leadboard have been recalculated (several times) due to tokenizer problems. EleutherAI/lm-evaluation-harness#531 (comment) Problems with the tokenizer may be one of the reasons why the scores in our HellaSwag test is lower than the Leaderboard scores. The other reason is that we are currently doing 0-shot tests and not 10-shot. There is some discussion about this here #2389. |
Beta Was this translation helpful? Give feedback.
-
The HellSwag calculation is broken on current master: I get a score of 71.00 after 400 tasks for LLaMA-v2-7B (see attached output). The change from the score of 77.5 we find in the table above occurred after PR 6381d4e. I have checked that the result is also different for LLaMA-v1-7B (68.50 vs 75.25 after 400 tasks). |
Beta Was this translation helpful? Give feedback.
-
@klosax When someone says that commit |
Beta Was this translation helpful? Give feedback.
-
Let me try again. |
Beta Was this translation helpful? Give feedback.
-
Excellent! This gives me the opportunity to rant against the policy of sqaush-merging massive changes to the code base, with 6381d4e being a great example why that is not good. Do I want to have the 200+ commits that lead from dadbed9 to 6381d4e in the history? Definitely not. Would it have been better to restructure the 200+ commit in #2398 into |
Beta Was this translation helpful? Give feedback.
-
So, the HellaSwag score change after #2398 is due to this line: Line 3030 in bae5c5f If I comment it out, I recover pre-GGUF scores. If I replace it with
I get a score very close to pre-GGUF but not quite the same (see #2805) |
Beta Was this translation helpful? Give feedback.
-
Here are some HellaSwag 10-shot scores for LLaMA-v2-7B. The test cases along with the 10-shot context added that I have used are attached (see at the end of the post). I tried a few variations and this version gave the highest score for the The dataset consists of 5001 tasks. These are the tasks from activities that have training data in the HellaSwag training dataset, and the context added before each task was taken from there. I verified that for 0-shot evaluation with this 5001-task subset I get the exact same 0-shot HellaSwag score of The following table shows the difference in HellaSwag score for different quantization types. If you want to have the actual HellaSwag score for a given quantization, just use
Here is a plot of the data in the above table. The x-axis is |
Beta Was this translation helpful? Give feedback.
-
Tbh, I don't understand this test. The test ran for 400 tasks. |
Beta Was this translation helpful? Give feedback.
-
I ran Falcon-7b now. I get 10-shot HellaSwag = 77.904 after 5001 tasks, which is almost as good as the score of 78.13 posted on the OpenLLM leader board. So, I guess, we could say that it works for Falcon as well as, if not better than, LLaMA models. Interestingly enough, in 0-shot evaluation Falcon-7b beats LLaMA-v2-7b: 76.97 vs 75.8 after 10042 tasks. I used a different 10-shot data file, which gets 10-shot HellaSwag = 78.24 for LLaMA-v2-7B (so very slightly better than the 78.16 I had before for LLaMA-2). I'm attaching it below. What is the difference between the two versions? In the original version that I used in my post above, I used the 10 longest examples from the training data that belong to the given activity. In this version, the training examples are selected randomly using a probability proportional to the example length squared. |
Beta Was this translation helpful? Give feedback.
-
Input: add |
Beta Was this translation helpful? Give feedback.
-
The purpose of these |
Beta Was this translation helpful? Give feedback.
-
Sorry to bring this up again. Why my hellaswag score so low? for example, both 400 randomized tasks and full 10042 tasks give me 50 ish score using MIstral-7B-v0.1, which is supposed to be in the 80 ish. |
Beta Was this translation helpful? Give feedback.
-
The HellaSwag test
The test is done by using examples of sentences that makes sense and others which do not make sense. The sentences are presented to the model in groups (tasks) of four with the same start but different endings. One of the endings in each group is labeled as the one that makes most sense. The probability of the model to predict the different endings are computed and if the labeled ending got the highest probability it it considered a correct prediction. The resulting accuracy score is the percent of groups with correct predictions.
More info can be found in the paper https://arxiv.org/abs/1905.07830
In order to compute the HellaSwag scores you need to download the datafile here: klosax/hellaswag_text_data.
Simple usage that runs the test on 400 random tasks in the datafile:
./perplexity --hellaswag -f hellaswag_val_full.txt -m modelfile.gguf
For a more accurate test you can specify the number of tasks to use in the computation:
./perplexity --hellaswag --hellaswag-tasks N -f hellaswag_val_full.txt -m modelfile.gguf
The table below is showing the score (0-shot acc_norm) for various models using 400 randomized HellaSwag tasks. The numbers in the "Leaderboard" column are taken from HuggingFace Open LLM Leaderboard. Note that the Leaderboard scores are about 1.5 higher because they are 10-shot. The "Params" column is the total number of elements in the model file and "Train tokens" are the total number of tokens the model was trained on.
The second part of the table contains models not yet supported in llama.cpp, but support may be added in the future.
The HellaSwag scores are correlated to the number of model parameters:
The 400 task 0-shot HellaSwag scores are highly correlated to the OpenLLM Leaderboard 10-shot HellaSwag scores:
(This post will be updated)
Beta Was this translation helpful? Give feedback.
All reactions