Skip to content

[FT] Support evaluations with tool use #719

@ercbot

Description

@ercbot

Issue encountered

I'm interested in running evals of models with tool use, which doesn't seem to be supported in lighteval.

The upstream inference providers (vllm, sglang, transformers, inference web APIs, etc.) have already adopted support for tool calling so it might make sense to include support for tool based evaluations.

Solution/Feature

A standardized way of including tool usage (single turn and multi turn) in benchmarks. It would involve handling both function invocations by the model and handling the results of the tool being returned to the model.

It should also have the feature of whether to evaluate the tool invocations and reasoning OR to just evaluate the model's final response.

Possible alternatives

Evaluating both generation and tool use seems to seems to be a rising tend. For instance the Berkeley Function Calling Leaderboard is a good example of a benchmark that incorporates tool use/function calling. They have their own implementation for running the benchmark but it's the useful feature of light eval that allows you to create your own evaluations.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions