[FT] Support evaluations with tool use

## Issue encountered

I'm interested in running evals of models with tool use, which doesn't seem to be supported in lighteval. 

The upstream inference providers (vllm, sglang, transformers, inference web APIs, etc.) have already adopted support for tool calling so it might make sense to include support for tool based evaluations. 

## Solution/Feature

A standardized way of including tool usage (single turn and multi turn) in benchmarks. It would involve handling both function invocations by the model and handling the results of the tool being returned to the model.

It should also have the feature of whether to evaluate the tool invocations and reasoning OR to just evaluate the model's final response. 

## Possible alternatives

Evaluating both generation and tool use seems to seems to be a rising tend. For instance the [Berkeley Function Calling Leaderboard](https://github.com/ShishirPatil/gorilla/tree/main/berkeley-function-call-leaderboard) is a good example of a benchmark that incorporates tool use/function calling. They have their own implementation for running the benchmark but it's the useful feature of light eval that allows you to create your own evaluations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FT] Support evaluations with tool use #719

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FT] Support evaluations with tool use #719

Description

Issue encountered

Solution/Feature

Possible alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions