-
Notifications
You must be signed in to change notification settings - Fork 988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BFCL] Add BFCL_V2_Live Dataset #580
Conversation
The change looks good to me in general. I will start spot testings to verify the functionalities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on gpt-4o-2024-08-06-FC, mistral models, yi-large-fc for spot checks. Success runs end-to-end.
Example command used during testing.
❯ python openfunctions_evaluation.py --model gpt-4o-2024-08-06-FC --test-category v2_live num-threads 8
❯ python eval_runner.py --model gpt-4o-2024-08-06-FC --test-category v2_live
In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into `simple`, `multiple function`, `parallel function`, `parallel multiple function`, and `relevance detection` groups, all evaluated through AST (Abstract Syntax Tree). By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments. To read more about the composition and construction of this live dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset. --------- **Also in this PR**: 1. Update to BFCL Dataset Format: - In the V1 version of BFCL, the `question` field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response. - To ensure consistency, messages from the V1 dataset have been converted to the V2_Live format. For example, a V1 entry like `"What is the weather like in Berkeley, CA"` is now represented as `"[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]"`. - Consequently, all V1 datasets have been renamed to V2 to reflect this change, signaling that they are not backward-compatible. - All model handlers and the eval checker has been updated accordingly. 2. Update to the overall_accuracy calculation formula: - For BFCL V2 Leaderboard, the overall accuracy will be the **unweighted** average of each of the sub-categories`. - `"exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` - For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the **weighted** average of each of the Live sub-categories. - `"live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"` 3. Simplification of Claude Handlers: - Previously, the codebase included two separate handlers: `ClaudeFCHandler` (for Claude models in FC mode) and `ClaudePromptingHandler` (for Claude models in prompting mode). - This PR merges these into a single `ClaudeHandler`, streamlining the code without altering functionality. 4. Improve Error Log Readability 5. resolve ShishirPatil#485 --------- Co-authored-by: Charlie Cheng-Jie Ji <charliechengjieji@berkeley.edu> Co-authored-by: Fanjia Yan <fanjiayan@berkeley.edu>
In this release, we hope to provide insights on whether the model exhibits overfitting with respect to the BFCL public dataset. Introducing the BFCL-Live dataset, which consists of 2.2k real-world function calling scenarios. This dataset is categorized into
simple
,multiple function
,parallel function
,parallel multiple function
, andrelevance detection
groups, all evaluated through AST (Abstract Syntax Tree).By comparing scores across the two BFCL datasets, we aim to identify any signs of data contamination. This will help ensure our model's performance is both robust and reliable across different data environments.
To read more about the composition and construction of this live dataset, please refer to our blog.
Thanks to @yixinhuang48 and @JasonHuang1103 for helping clean the dataset.
Also in this PR:
Update to BFCL Dataset Format:
question
field represented the user's query. With the introduction of V2_Live, the format has been updated to accommodate the inclusion of system prompts, user prompts, and assistant response."What is the weather like in Berkeley, CA"
is now represented as"[{"role": "user", "content": "What is the weather like in Berkeley, CA"}]"
.Update to the overall_accuracy calculation formula:
For BFCL V2 Leaderboard, the overall accuracy will be the unweighted average of each of the sub-categories.
"exec_simple", "exec_parallel", "exec_multiple", "exec_parallel_multiple", "simple", "irrelevance", "parallel", "multiple", "parallel_multiple", "java", "javascript", "rest", "live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
For BFCL V2 Live Leaderboard (this contains only the Live categories), the overall accuracy will be the weighted average of each of the Live sub-categories.
"live_simple", "live_multiple", "live_parallel", "live_parallel_multiple", "live_irrelevance", "live_relevance"
Simplification of Claude Handlers:
ClaudeFCHandler
(for Claude models in FC mode) andClaudePromptingHandler
(for Claude models in prompting mode).ClaudeHandler
, streamlining the code without altering functionality.Improve Error Log Readability
resolve [BFCL] Evaluation with Correct Precision Settings for Locally-Hosted Models #575
resolve [BFCL] Get rid of legacy naming convention for LLM generated files #485
Co-authored-by: Charlie Cheng-Jie Ji charliechengjieji@berkeley.edu
Co-authored-by: Fanjia Yan fanjiayan@berkeley.edu