Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would you want to make a leaderboard for this? #10

Open
clefourrier opened this issue Mar 6, 2024 · 3 comments
Open

Would you want to make a leaderboard for this? #10

clefourrier opened this issue Mar 6, 2024 · 3 comments

Comments

@clefourrier
Copy link

clefourrier commented Mar 6, 2024

Hi!

Super cool work! I'm a researcher at HuggingFace working on evaluation and leaderboards.

I understand that this cool eval suite is first and foremost there to evaluate use cases that you personally find interesting, and that it might/will change through time, therefore not making it very relevant to build a leaderboard from.

However, I think that the community really lacks leaderboards for applied and very concrete tasks, like your C++ evaluation, or code conversion tests. For non devs, leaderboards are an interesting way to get an idea of model capabilities "on the surface".

So would you be interested in pinning a version and making a leaderboard out of it? If yes, I'd love to give you a hand

(Side note: we've got good ways to evaluate chat capabilities through Elo scores and arenas, of course a range of so many purely academic benchmarks, and now we're starting to get some more leaderboards on applied datasets (like enterprise use cases), but there's a strong lack on practical benchmarks like yours imo)

@carlini
Copy link
Owner

carlini commented Mar 28, 2024

Yeah maybe this isn't a bad idea... I did update my initial blog post with claude-3 and mistral large
(https://nicholas.carlini.com/writing/2024/my-benchmark-for-large-language-models.html)

But maybe having an explicit leaderboard wouldn't be so bad. I've got a number of email requests for exactly this as well.

In order to do this there would need to be a few changes, that if you'd be interested in making would be great:

  • Some way to keep the model runs up to date when tests change. The simplest thing would be to re-run each test on each model every time, but this seems expensive. A slightly better idea is to maybe tag each results/ with the git commit hash, and then when re-generating the table see which tests are new or different from the last set of runs, and only re-run those.
  • Some way to decide which tests make it in the leaderboard. All of them? Or make a leaderboard that's filterable by tag? Right now I keep them all because I mainly put in the questions I want but someone else might care about different things.
  • A better way to add new models. Right now this project needs a nonzero amount of work to add new models.
  • Some kind of thing that would generate a leaderboard automatically and put it ... somewhere? At the top of the README kind of like I have now? On huggingface like the other things?

@clefourrier
Copy link
Author

I'd keep your evals as is for now, and add tags later if they are requested by the community.

For adding new models, when you say it requires non zero work, is it an issue of compute, of not having all the task descriptions grouped, ...?

For the last point, the actual leaderboard aspect, we've got templates which read from a dataset and automatically updated the displayed results (here).

@carlini
Copy link
Owner

carlini commented Apr 11, 2024

The work is just that you have to create a model/[llm].py file. I suppose for the case of huggingface models this should be trivial as long as they have the stuff set up to do the chat interface tokenizer stuff. (I don't remember what this is actually called.)

I've just pushed a commit 656a597 that adds support for incremental builds. So that should make re-running the benchmark (much) faster.

I'll take a look at this huggingface page and see if I can get something uploading to there. I have my own personal server that I can run things on but it's not beefy enough to run any of the largest models. (I can run 7B models, but not much more.) I may see how much work it would be to try and create a cloud image that would just clone this and run the models. It probably wouldn't end up being too expensive per run if I can make it work easily...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants