[BFCL] Add BFCL_V2_Live Dataset #580

HuanzhiMao · 2024-08-13T07:23:49Z

In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into simple, multiple function, parallel function, parallel multiple function, and relevance detection groups, all evaluated through AST (Abstract Syntax Tree).

By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments.

To read more about the composition and construction of this live dataset, please refer to our blog.

Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset.

Also in this PR:

Update to BFCL Dataset Format:
- In the V1 version of BFCL, the question field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response.
- To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like "What is the weather like in Berkeley, CA" is now represented as "[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]".
- Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible.
- All model handlers and the eval checker has been updated accordingly.
Update to the overall_accuracy calculation formula:
- For BFCL V2 Leaderboard, the overall accuracy will be the unweighted average of each of the sub-categories.
  - "exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
- For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the weighted average of each of the Live sub-categories.
  - "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
Simplification of Claude Handlers:
- Previously, the codebase included two separate handlers: ClaudeFCHandler (for Claude models in FC mode) and ClaudePromptingHandler (for Claude models in prompting mode).
- This PR merges these into a single ClaudeHandler, streamlining the code without altering functionality.
Improve Error Log Readability
resolve [BFCL] Evaluation with Correct Precision Settings for Locally-Hosted Models #575
resolve [BFCL] Get rid of legacy naming convention for LLM generated files #485

Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu

berkeley-function-call-leaderboard/README.md

berkeley-function-call-leaderboard/eval_checker/eval_runner.py

Fanjia-Yan · 2024-08-15T23:11:16Z

The change looks good to me in general. I will start spot testings to verify the functionalities

Fanjia-Yan

🚢

CharlieJCJ

Tested on gpt-4o-2024-08-06-FC, mistral models, yi-large-fc for spot checks. Success runs end-to-end.

Example command used during testing.

❯ python openfunctions_evaluation.py --model gpt-4o-2024-08-06-FC --test-category v2_live num-threads 8  

❯ python eval_runner.py --model gpt-4o-2024-08-06-FC --test-category v2_live

berkeley-function-call-leaderboard/openfunctions_evaluation.py

This PR updates the leaderboard with the new BFCL V2 dataset score from #580.

@yixinhuang48

In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into `simple`, `multiple function`, `parallel function`, `parallel multiple function`, and `relevance detection` groups, all evaluated through AST (Abstract Syntax Tree). By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments. To read more about the composition and construction of this live dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset. --------- **Also in this PR**: 1. Update to BFCL Dataset Format: - In the V1 version of BFCL, the `question` field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response. - To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like `"What is the weather like in Berkeley, CA"` is now represented as `"[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]"`. - Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible. - All model handlers and the eval checker has been updated accordingly. 2. Update to the overall_accuracy calculation formula: - For BFCL V2 Leaderboard, the overall accuracy will be the **unweighted** average of each of the sub-categories`. - `"exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` - For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the **weighted** average of each of the Live sub-categories. - `"live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` 3. Simplification of Claude Handlers: - Previously, the codebase included two separate handlers: `ClaudeFCHandler` (for Claude models in FC mode) and `ClaudePromptingHandler` (for Claude models in prompting mode). - This PR merges these into a single `ClaudeHandler`, streamlining the code without altering functionality. 4. Improve Error Log Readability 5. resolve ShishirPatil#485 --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>

HuanzhiMao added 14 commits August 13, 2024 00:22

standardize format for BFCL V1 dataset and possible answer

5f7b238

update eval_runner to support relevance category

046560a

update checker with new format

5d0bb97

update claude handler; remove outdated methods

82da573

update utils and constant

283ea13

update test file mapping

c879235

update model handlers accordingly

33c6b35

add code to generate separate live csv for leaderboard

32b2ae8

fix typo

cad498e

fix one more typo

b8a6ca6

add v2 dataset

6755ef2

update README

179d97c

rename categories for clarity

ab957d6

Merge branch 'main' into bfcl_v2_live

5294b17

HuanzhiMao marked this pull request as ready for review August 15, 2024 16:54

HuanzhiMao mentioned this pull request Aug 15, 2024

[BFCL] Leaderboard Update, in sync with #580 #584

Merged

Fanjia-Yan reviewed Aug 15, 2024

View reviewed changes

berkeley-function-call-leaderboard/README.md Outdated Show resolved Hide resolved

Fanjia-Yan reviewed Aug 15, 2024

View reviewed changes

berkeley-function-call-leaderboard/eval_checker/eval_runner.py Show resolved Hide resolved

use weighted average instead of unweighted for summary column

832d37a

nit: fix typo, add doc string

4ac7876

Fanjia-Yan approved these changes Aug 15, 2024

View reviewed changes

CharlieJCJ reviewed Aug 16, 2024

View reviewed changes

CharlieJCJ requested changes Aug 16, 2024

View reviewed changes

berkeley-function-call-leaderboard/openfunctions_evaluation.py Outdated Show resolved Hide resolved

HuanzhiMao added 6 commits August 15, 2024 23:03

fix xlam handler

0c14a47

revert back to unweighted average

58743bd

fix prompt processing logic for oss models

2cd2d82

nit: add explanation for dataset index

2eab158

rename test files

a4fc061

update README test category options

73f17ce

HuanzhiMao and others added 9 commits August 16, 2024 03:31

fix typo

d2a68c2

clean up

58506a5

fix dataset entry to use the new format

671af5f

Merge branch 'main' into bfcl_v2_live

1a52d49

improve error log readability

10f8b02

fixed empty param hadler parsing

8b0aa57

use bfloat16 for OSS model response generation

ffa740a

update dataset entry

3e468d0

update changelog

9898d35

ShishirPatil approved these changes Aug 19, 2024

View reviewed changes

ShishirPatil merged commit 30124c4 into ShishirPatil:main Aug 19, 2024

ShishirPatil pushed a commit that referenced this pull request Aug 19, 2024

[BFCL] Leaderboard Update, in sync with #580 (#584)

83d7f08

This PR updates the leaderboard with the new BFCL V2 dataset score from #580.

HuanzhiMao deleted the bfcl_v2_live branch August 19, 2024 17:24

HuanzhiMao mentioned this pull request Aug 22, 2024

[BFCL] Potential bug in calculating accuracy #591

Closed

HuanzhiMao added the BFCL-General General BFCL Issue label Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Add BFCL_V2_Live Dataset #580

[BFCL] Add BFCL_V2_Live Dataset #580

HuanzhiMao commented Aug 13, 2024 •

edited

Loading

Fanjia-Yan commented Aug 15, 2024

Fanjia-Yan left a comment

CharlieJCJ left a comment

[BFCL] Add BFCL_V2_Live Dataset #580

[BFCL] Add BFCL_V2_Live Dataset #580

Conversation

HuanzhiMao commented Aug 13, 2024 • edited Loading

Fanjia-Yan commented Aug 15, 2024

Fanjia-Yan left a comment

Choose a reason for hiding this comment

CharlieJCJ left a comment

Choose a reason for hiding this comment

HuanzhiMao commented Aug 13, 2024 •

edited

Loading