-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would you want to make a leaderboard for this? #10
Comments
Yeah maybe this isn't a bad idea... I did update my initial blog post with claude-3 and mistral large But maybe having an explicit leaderboard wouldn't be so bad. I've got a number of email requests for exactly this as well. In order to do this there would need to be a few changes, that if you'd be interested in making would be great:
|
I'd keep your evals as is for now, and add tags later if they are requested by the community. For adding new models, when you say it requires non zero work, is it an issue of compute, of not having all the task descriptions grouped, ...? For the last point, the actual leaderboard aspect, we've got templates which read from a dataset and automatically updated the displayed results (here). |
The work is just that you have to create a model/[llm].py file. I suppose for the case of huggingface models this should be trivial as long as they have the stuff set up to do the chat interface tokenizer stuff. (I don't remember what this is actually called.) I've just pushed a commit 656a597 that adds support for incremental builds. So that should make re-running the benchmark (much) faster. I'll take a look at this huggingface page and see if I can get something uploading to there. I have my own personal server that I can run things on but it's not beefy enough to run any of the largest models. (I can run 7B models, but not much more.) I may see how much work it would be to try and create a cloud image that would just clone this and run the models. It probably wouldn't end up being too expensive per run if I can make it work easily... |
Hi!
Super cool work! I'm a researcher at HuggingFace working on evaluation and leaderboards.
I understand that this cool eval suite is first and foremost there to evaluate use cases that you personally find interesting, and that it might/will change through time, therefore not making it very relevant to build a leaderboard from.
However, I think that the community really lacks leaderboards for applied and very concrete tasks, like your C++ evaluation, or code conversion tests. For non devs, leaderboards are an interesting way to get an idea of model capabilities "on the surface".
So would you be interested in pinning a version and making a leaderboard out of it? If yes, I'd love to give you a hand
(Side note: we've got good ways to evaluate chat capabilities through Elo scores and arenas, of course a range of so many purely academic benchmarks, and now we're starting to get some more leaderboards on applied datasets (like enterprise use cases), but there's a strong lack on practical benchmarks like yours imo)
The text was updated successfully, but these errors were encountered: