Description
Issue encountered
I'm interested in running evals of models with tool use, which doesn't seem to be supported in lighteval.
The upstream inference providers (vllm, sglang, transformers, inference web APIs, etc.) have already adopted support for tool calling so it might make sense to include support for tool based evaluations.
Solution/Feature
A standardized way of including tool usage (single turn and multi turn) in benchmarks. It would involve handling both function invocations by the model and handling the results of the tool being returned to the model.
It should also have the feature of whether to evaluate the tool invocations and reasoning OR to just evaluate the model's final response.
Possible alternatives
Evaluating both generation and tool use seems to seems to be a rising tend. For instance the Berkeley Function Calling Leaderboard is a good example of a benchmark that incorporates tool use/function calling. They have their own implementation for running the benchmark but it's the useful feature of light eval that allows you to create your own evaluations.