Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Salesforce xLAM handler and fix minor issues #532

Merged
merged 6 commits into from
Jul 26, 2024

Conversation

zuxin666
Copy link
Contributor

This PR will add ability to inference on Salesforce xLAM function-calling models. They are open-sourced on Huggingface. Therefore, no cost per thousand function calls will be recorded.

Apart from the xlam_handler, this PR also:

  • improves the oss_handler and the base_handler by passing the dtype (for vllm) as another argument. This is crucial since we have witnessed some output issues if we enforce our bfloat16 model to be loaded with float16.
  • fixed one issue in berkeley-function-call-leaderboard/apply_function_credential_config.py. Without the fixing, the saved data with this script will not be loaded properly.

@HuanzhiMao Would you please take a look? Thanks for the great benchmark!

@HuanzhiMao
Copy link
Collaborator

Thanks for your contribution @zuxin666! Will review it today.

Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zuxin666,

A few things that I think need to be addressed.

  1. In the convert_to_xlam_tool function, these three lines are not necessary. In your model card on HuggingFace, the convert_to_xlam_tool function doesn't contain these lines. Further, the example function doc provided on HuggingFace doesn't have the per-parameter required field, but rather a general required list (the same format as the BFCL dataset). Thus no conversion is needed for the required part.
  2. In the decode_ast function, these lines should be removed. This will result in double-casting issue, and introduce false positives in evaluation score. For a detailed explanation, please check out [BFCL] Fix Double-Casting Issue in model_handler for Java and JS category.  #516. We addressed this issue for all model handlers there.
  3. The BFCL dataset is not in the OpenAI format. Thus, you might want to have these lines (from the OpenAI handler) to convert the dataset entry to OpenAI format before calling the convert_to_xlam_tool function.
  4. Language-specific hints at the end of the prompt should be included (this line). Every model should get the same prompt to ensure fairness. Edit, this has been done in the OSS_Handler part.

Let me know what you think and thank you again for your contribution.

@zuxin666
Copy link
Contributor Author

Hi @HuanzhiMao , thanks for the feedback!!

  • Regarding the second issue, I do not quite understand. If I remove these lines, there are a lot of false negatives for the JS and JAVA evaluations. For example, I got the following misclassifications:

{"id": 2, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'useShortName'. Expected type String, got bool. Parameter value: True."], "error_type": "type_error:java", "prompt": {"question": "How can I generate SQL completion proposals for a table named 'Customers' in a database, considering that I prefer using short names and the additional parameters include a limit of '50' and a schema filter set to 'public'?", "function": {"name": "SQLCompletionAnalyzer.makeProposalsFromObject", "description": "Generates SQL completion proposals based on the given database object, name preference, and additional parameters.", "parameters": {"type": "dict", "properties": {"object": {"type": "any", "description": "The database object for which to generate proposals."}, "useShortName": {"type": "boolean", "description": "Indicates whether to use short names for the proposals."}, "params": {"type": "HashMap", "description": "A map of additional parameters to customize the proposals."}}, "required": ["object", "useShortName", "params"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"SQLCompletionAnalyzer.makeProposalsFromObject\", \"arguments\": {\"object\": \"Customers\", \"useShortName\": true, \"params\": {\"limit\": 50, \"schemaFilter\": \"public\"}}}]}", "model_result_decoded": [{"SQLCompletionAnalyzer.makeProposalsFromObject": {"object": "Customers", "useShortName": true, "params": {"limit": 50, "schemaFilter": "public"}}}], "possible_answer": {"SQLCompletionAnalyzer.makeProposalsFromObject": {"object": ["Customers"], "useShortName": [true], "params": [{"limit": 50, "schemaFilter": "public"}]}}}

{"id": 24, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'recommendStatus'. Expected type String, got int. Parameter value: 1."], "error_type": "type_error:java", "prompt": {"question": "How can I obtain a list of new home products that contain 'LED TV' in their product name, have a recommendation status of 1, and want to retrieve the third page of results with 20 items per page?", "function": {"name": "SmsHomeNewProductServiceImpl.list", "description": "Retrieves a list of SmsHomeNewProduct entities based on the provided product name, recommendation status, and pagination settings.", "parameters": {"type": "dict", "properties": {"productName": {"type": "String", "description": "The name of the product to filter by, using a 'like' search pattern."}, "recommendStatus": {"type": "integer", "description": "The recommendation status to filter by."}, "pageSize": {"type": "integer", "description": "The number of items to return per page."}, "pageNum": {"type": "integer", "description": "The page number to retrieve."}}, "required": ["productName", "recommendStatus", "pageSize", "pageNum"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"SmsHomeNewProductServiceImpl.list\", \"arguments\": {\"productName\": \"LED TV\", \"recommendStatus\": 1, \"pageSize\": 20, \"pageNum\": 3}}]}", "model_result_decoded": [{"SmsHomeNewProductServiceImpl.list": {"productName": "LED TV", "recommendStatus": 1, "pageSize": 20, "pageNum": 3}}], "possible_answer": {"SmsHomeNewProductServiceImpl.list": {"productName": ["LED TV"], "recommendStatus": [1], "pageSize": [20], "pageNum": [3]}}}

{"id": 25, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'ids'. Expected type String, got list. Parameter value: [101, 102, 103]."], "error_type": "type_error:java", "prompt": {"question": "How can I change the visibility of product categories with IDs 101, 102, and 103 to hidden in the e-commerce platform's admin panel?", "function": {"name": "PmsProductCategoryController.updateShowStatus", "description": "Updates the show status of a list of product categories to either visible or hidden.", "parameters": {"type": "dict", "properties": {"ids": {"type": "ArrayList", "description": "A list of product category IDs to update. Product category IDs are integer", "items": {"type": "integer"}}, "showStatus": {"type": "integer", "description": "The new show status for the product categories (e.g., 0 for hidden, 1 for visible)."}}, "required": ["ids", "showStatus"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"PmsProductCategoryController.updateShowStatus\", \"arguments\": {\"ids\": [101, 102, 103], \"showStatus\": 0}}]}", "model_result_decoded": [{"PmsProductCategoryController.updateShowStatus": {"ids": [101, 102, 103], "showStatus": 0}}], "possible_answer": {"PmsProductCategoryController.updateShowStatus": {"ids": [[101, 102, 103]], "showStatus": [0]}}}

There are a lot of such cases, where I can not find why the model outputs should be misclassified as false according to the model_result_raw, model_result_decoded, and possible_answer. What could be the potential issues here?

  • Regarding the point 4, should these already be included in the OSSHanlder here?

I will check the remaining two items later.
Thanks!

@HuanzhiMao
Copy link
Collaborator

Hey @zuxin666,
Thanks for pointing this out. After investigating, I found that there were errors in how the current pipeline processes the prompt and function documentation for both the Java and JavaScript categories. Sorry about that!
I'll follow up with you once PR #538 has been merged.

@ShishirPatil
Copy link
Owner

#538 has been merged.

@HuanzhiMao
Copy link
Collaborator

HuanzhiMao commented Jul 25, 2024

Hey @zuxin666,

Sorry for the delayed response.

PR #538 and #545 should have addressed the Java and JavaScript issues you mentioned. I tested this PR with both the version you currently have and the version where these lines are removed for double-casting reason. The results show that adding the double-casting only brings a <0.5% increase in overall accuracy for both models, with about 5% increase for Salesforce/xLAM-1b-fc-r and 1% increase for Salesforce/xLAM-7b-fc-r in the affected Java/JavaScript category.

Regarding my previous fourth point, you are correct. It's already handled in OSSHandler, so no need to worry about it in the xLAMHandler.

Looking forward to seeing the Salesforce models on the leaderboard!

Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zuxin666
Copy link
Contributor Author

Hi @HuanzhiMao , thanks for fixing them!! I checked the other two items, and found that there are no issues for the convert_to_xlam_tool function as the "default" entry could be correctly parsed. We will update our Huggingface model card to add this part.

We have also removed the lines that might cause double-casting problems in the decode_ast function.

We just did another round of experiments with the newest updated data and eval framework and confirm that the pipeline works well and as expected. The results from our side (with 1 A100 GPU) are as follows for your reference:

Rank Overall Acc Model Model Link Organization License AST Summary Exec Summary Simple Function AST Python Simple Function AST Java Simple Function AST JavaScript Simple Function AST Multiple Functions AST Parallel Functions AST Parallel Multiple AST Simple Function Exec Python Simple Function Exec REST Simple Function Exec Multiple Functions Exec Parallel Functions Exec Parallel Multiple Exec Relevance Detection Cost ($ Per 1k Function Calls) Latency Mean (s) Latency Standard Deviation (s) Latency 95th Percentile (s)
1 88.29% xLAM-7b-fc-r (FC) Link Salesforce cc-by-nc-4.0 89.58% 87.12% 85.82% 94.75% 60.00% 66.00% 94.00% 91.00% 87.50% 96.47% 99.00% 92.86% 88.00% 84.00% 80.00% 84.17% N/A N/A N/A N/A
2 78.35% xLAM-1b-fc-r (FC) Link Salesforce cc-by-nc-4.0 80.86% 81.96% 81.45% 88.50% 61.00% 66.00% 89.00% 78.00% 75.00% 78.82% 99.00% 50.00% 90.00% 84.00% 75.00% 62.08% N/A N/A N/A N/A

Looking forward to see the model on the leaderboard! Thank you!!

@HuanzhiMao
Copy link
Collaborator

Awesome!
One thing to note is that, we use the following formula to estimate the cost and latency for locally-hosted models. image
Since the xLAM model cannot be loaded into 8 x V100 machines (bfloat16 is not compatible with V100), it will have the Cost and Latency fields as N/A.

@zuxin666
Copy link
Contributor Author

Sure no problem. Hope this formula could be updated in the future to accommodate the most recent and advanced models, as many of them are also in bf16 format like Llama 3.1 😂.

@HuanzhiMao
Copy link
Collaborator

HuanzhiMao commented Jul 25, 2024

Yea I agree. The reason why the formula sticks to use 8 x V100 machine is because that's the "most advanced" computing resource we have in our lab 😂 We will see if we can get some better GPUs :/

@ShishirPatil ShishirPatil merged commit 189f317 into ShishirPatil:main Jul 26, 2024
ShishirPatil pushed a commit that referenced this pull request Jul 28, 2024
…ew Model) (#558)

This PR updates the leaderboard to reflect the addition of the following
models from SalesForce, Yi, and NousResearch.

- #532
- #543
- #556 
- #542

The scores for all the existing models are unchanged.
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
This PR will add ability to inference on Salesforce xLAM
function-calling models. They are open-sourced on
[Huggingface](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4).
Therefore, no cost per thousand function calls will be recorded.

Apart from the `xlam_handler`, this PR also:
- improves the `oss_handler` and the `base_handler` by passing the
`dtype` (for vllm) as another argument. This is crucial since we have
witnessed some output issues if we enforce our `bfloat16` model to be
loaded with `float16`.
- fixed one issue in
`berkeley-function-call-leaderboard/apply_function_credential_config.py`.
Without the fixing, the saved data with this script will not be loaded
properly.

@HuanzhiMao Would you please take a look? Thanks for the great
benchmark!

---------

Co-authored-by: Huanzhi (Hans) Mao <huanzhimao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants