Add Salesforce xLAM handler and fix minor issues #532

zuxin666 · 2024-07-18T21:52:43Z

This PR will add ability to inference on Salesforce xLAM function-calling models. They are open-sourced on Huggingface. Therefore, no cost per thousand function calls will be recorded.

Apart from the xlam_handler, this PR also:

improves the oss_handler and the base_handler by passing the dtype (for vllm) as another argument. This is crucial since we have witnessed some output issues if we enforce our bfloat16 model to be loaded with float16.
fixed one issue in berkeley-function-call-leaderboard/apply_function_credential_config.py. Without the fixing, the saved data with this script will not be loaded properly.

@HuanzhiMao Would you please take a look? Thanks for the great benchmark!

HuanzhiMao · 2024-07-18T22:04:49Z

Thanks for your contribution @zuxin666! Will review it today.

HuanzhiMao

Hey @zuxin666,

A few things that I think need to be addressed.

In the convert_to_xlam_tool function, these three lines are not necessary. In your model card on HuggingFace, the convert_to_xlam_tool function doesn't contain these lines. Further, the example function doc provided on HuggingFace doesn't have the per-parameter required field, but rather a general required list (the same format as the BFCL dataset). Thus no conversion is needed for the required part.
In the decode_ast function, these lines should be removed. This will result in double-casting issue, and introduce false positives in evaluation score. For a detailed explanation, please check out [BFCL] Fix Double-Casting Issue in model_handler for Java and JS category. #516. We addressed this issue for all model handlers there.
The BFCL dataset is not in the OpenAI format. Thus, you might want to have these lines (from the OpenAI handler) to convert the dataset entry to OpenAI format before calling the convert_to_xlam_tool function.
~~Language-specific hints at the end of the prompt should be included (this line). Every model should get the same prompt to ensure fairness.~~ Edit, this has been done in the OSS_Handler part.

Let me know what you think and thank you again for your contribution.

zuxin666 · 2024-07-20T06:51:00Z

Hi @HuanzhiMao , thanks for the feedback!!

Regarding the second issue, I do not quite understand. If I remove these lines, there are a lot of false negatives for the JS and JAVA evaluations. For example, I got the following misclassifications:

{"id": 2, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'useShortName'. Expected type String, got bool. Parameter value: True."], "error_type": "type_error:java", "prompt": {"question": "How can I generate SQL completion proposals for a table named 'Customers' in a database, considering that I prefer using short names and the additional parameters include a limit of '50' and a schema filter set to 'public'?", "function": {"name": "SQLCompletionAnalyzer.makeProposalsFromObject", "description": "Generates SQL completion proposals based on the given database object, name preference, and additional parameters.", "parameters": {"type": "dict", "properties": {"object": {"type": "any", "description": "The database object for which to generate proposals."}, "useShortName": {"type": "boolean", "description": "Indicates whether to use short names for the proposals."}, "params": {"type": "HashMap", "description": "A map of additional parameters to customize the proposals."}}, "required": ["object", "useShortName", "params"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"SQLCompletionAnalyzer.makeProposalsFromObject\", \"arguments\": {\"object\": \"Customers\", \"useShortName\": true, \"params\": {\"limit\": 50, \"schemaFilter\": \"public\"}}}]}", "model_result_decoded": [{"SQLCompletionAnalyzer.makeProposalsFromObject": {"object": "Customers", "useShortName": true, "params": {"limit": 50, "schemaFilter": "public"}}}], "possible_answer": {"SQLCompletionAnalyzer.makeProposalsFromObject": {"object": ["Customers"], "useShortName": [true], "params": [{"limit": 50, "schemaFilter": "public"}]}}}

{"id": 24, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'recommendStatus'. Expected type String, got int. Parameter value: 1."], "error_type": "type_error:java", "prompt": {"question": "How can I obtain a list of new home products that contain 'LED TV' in their product name, have a recommendation status of 1, and want to retrieve the third page of results with 20 items per page?", "function": {"name": "SmsHomeNewProductServiceImpl.list", "description": "Retrieves a list of SmsHomeNewProduct entities based on the provided product name, recommendation status, and pagination settings.", "parameters": {"type": "dict", "properties": {"productName": {"type": "String", "description": "The name of the product to filter by, using a 'like' search pattern."}, "recommendStatus": {"type": "integer", "description": "The recommendation status to filter by."}, "pageSize": {"type": "integer", "description": "The number of items to return per page."}, "pageNum": {"type": "integer", "description": "The page number to retrieve."}}, "required": ["productName", "recommendStatus", "pageSize", "pageNum"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"SmsHomeNewProductServiceImpl.list\", \"arguments\": {\"productName\": \"LED TV\", \"recommendStatus\": 1, \"pageSize\": 20, \"pageNum\": 3}}]}", "model_result_decoded": [{"SmsHomeNewProductServiceImpl.list": {"productName": "LED TV", "recommendStatus": 1, "pageSize": 20, "pageNum": 3}}], "possible_answer": {"SmsHomeNewProductServiceImpl.list": {"productName": ["LED TV"], "recommendStatus": [1], "pageSize": [20], "pageNum": [3]}}}

{"id": 25, "model_name": "Salesforce_xLAM-7b-fc-r", "test_category": "java", "valid": false, "error": ["Incorrect type for parameter 'ids'. Expected type String, got list. Parameter value: [101, 102, 103]."], "error_type": "type_error:java", "prompt": {"question": "How can I change the visibility of product categories with IDs 101, 102, and 103 to hidden in the e-commerce platform's admin panel?", "function": {"name": "PmsProductCategoryController.updateShowStatus", "description": "Updates the show status of a list of product categories to either visible or hidden.", "parameters": {"type": "dict", "properties": {"ids": {"type": "ArrayList", "description": "A list of product category IDs to update. Product category IDs are integer", "items": {"type": "integer"}}, "showStatus": {"type": "integer", "description": "The new show status for the product categories (e.g., 0 for hidden, 1 for visible)."}}, "required": ["ids", "showStatus"]}}}, "model_result_raw": "\n{\"tool_calls\": [{\"name\": \"PmsProductCategoryController.updateShowStatus\", \"arguments\": {\"ids\": [101, 102, 103], \"showStatus\": 0}}]}", "model_result_decoded": [{"PmsProductCategoryController.updateShowStatus": {"ids": [101, 102, 103], "showStatus": 0}}], "possible_answer": {"PmsProductCategoryController.updateShowStatus": {"ids": [[101, 102, 103]], "showStatus": [0]}}}

There are a lot of such cases, where I can not find why the model outputs should be misclassified as false according to the model_result_raw, model_result_decoded, and possible_answer. What could be the potential issues here?

Regarding the point 4, should these already be included in the OSSHanlder here?

I will check the remaining two items later.
Thanks!

HuanzhiMao · 2024-07-21T08:05:24Z

Hey @zuxin666,
Thanks for pointing this out. After investigating, I found that there were errors in how the current pipeline processes the prompt and function documentation for both the Java and JavaScript categories. Sorry about that!
I'll follow up with you once PR #538 has been merged.

ShishirPatil · 2024-07-22T06:44:57Z

#538 has been merged.

HuanzhiMao · 2024-07-25T06:28:25Z

Hey @zuxin666,

Sorry for the delayed response.

PR #538 and #545 should have addressed the Java and JavaScript issues you mentioned. I tested this PR with both the version you currently have and the version where these lines are removed for double-casting reason. The results show that adding the double-casting only brings a <0.5% increase in overall accuracy for both models, with about 5% increase for Salesforce/xLAM-1b-fc-r and 1% increase for Salesforce/xLAM-7b-fc-r in the affected Java/JavaScript category.

Regarding my previous fourth point, you are correct. It's already handled in OSSHandler, so no need to worry about it in the xLAMHandler.

Looking forward to seeing the Salesforce models on the leaderboard!

HuanzhiMao

LGTM

zuxin666 · 2024-07-25T23:12:02Z

Hi @HuanzhiMao , thanks for fixing them!! I checked the other two items, and found that there are no issues for the convert_to_xlam_tool function as the "default" entry could be correctly parsed. We will update our Huggingface model card to add this part.

We have also removed the lines that might cause double-casting problems in the decode_ast function.

We just did another round of experiments with the newest updated data and eval framework and confirm that the pipeline works well and as expected. The results from our side (with 1 A100 GPU) are as follows for your reference:

Rank	Overall Acc	Model	Model Link	Organization	License	AST Summary	Exec Summary	Simple Function AST	Python Simple Function AST	Java Simple Function AST	JavaScript Simple Function AST	Multiple Functions AST	Parallel Functions AST	Parallel Multiple AST	Simple Function Exec	Python Simple Function Exec	REST Simple Function Exec	Multiple Functions Exec	Parallel Functions Exec	Parallel Multiple Exec	Relevance Detection	Cost ($ Per 1k Function Calls)	Latency Mean (s)	Latency Standard Deviation (s)	Latency 95th Percentile (s)
1	88.29%	xLAM-7b-fc-r (FC)	Link	Salesforce	cc-by-nc-4.0	89.58%	87.12%	85.82%	94.75%	60.00%	66.00%	94.00%	91.00%	87.50%	96.47%	99.00%	92.86%	88.00%	84.00%	80.00%	84.17%	N/A	N/A	N/A	N/A
2	78.35%	xLAM-1b-fc-r (FC)	Link	Salesforce	cc-by-nc-4.0	80.86%	81.96%	81.45%	88.50%	61.00%	66.00%	89.00%	78.00%	75.00%	78.82%	99.00%	50.00%	90.00%	84.00%	75.00%	62.08%	N/A	N/A	N/A	N/A

Looking forward to see the model on the leaderboard! Thank you!!

HuanzhiMao · 2024-07-25T23:21:25Z

Awesome!
One thing to note is that, we use the following formula to estimate the cost and latency for locally-hosted models.
Since the xLAM model cannot be loaded into 8 x V100 machines (bfloat16 is not compatible with V100), it will have the Cost and Latency fields as N/A.

zuxin666 · 2024-07-25T23:24:49Z

Sure no problem. Hope this formula could be updated in the future to accommodate the most recent and advanced models, as many of them are also in bf16 format like Llama 3.1 😂.

HuanzhiMao · 2024-07-25T23:31:42Z

Yea I agree. The reason why the formula sticks to use 8 x V100 machine is because that's the "most advanced" computing resource we have in our lab 😂 We will see if we can get some better GPUs :/

…ew Model) (#558) This PR updates the leaderboard to reflect the addition of the following models from SalesForce, Yi, and NousResearch. - #532 - #543 - #556 - #542 The scores for all the existing models are unchanged.

@HuanzhiMao

This PR will add ability to inference on Salesforce xLAM function-calling models. They are open-sourced on [Huggingface](https://huggingface.co/collections/Salesforce/xlam-models-65f00e2a0a63bbcd1c2dade4). Therefore, no cost per thousand function calls will be recorded. Apart from the `xlam_handler`, this PR also: - improves the `oss_handler` and the `base_handler` by passing the `dtype` (for vllm) as another argument. This is crucial since we have witnessed some output issues if we enforce our `bfloat16` model to be loaded with `float16`. - fixed one issue in `berkeley-function-call-leaderboard/apply_function_credential_config.py`. Without the fixing, the saved data with this script will not be loaded properly. @HuanzhiMao Would you please take a look? Thanks for the great benchmark! --------- Co-authored-by: Huanzhi (Hans) Mao <huanzhimao@gmail.com>

Add xLAM handler and fix minor issues

1d4a409

Merge branch 'main' into main

49fefec

HuanzhiMao requested changes Jul 20, 2024

View reviewed changes

zuxin666 and others added 2 commits July 22, 2024 14:31

Merge branch 'ShishirPatil:main' into main

2a49bac

Merge branch 'main' into pr/zuxin666/532

e7defa1

Update xLAM handler

b03fbb1

HuanzhiMao approved these changes Jul 25, 2024

View reviewed changes

minor fix for license

2443f61

ShishirPatil merged commit 189f317 into ShishirPatil:main Jul 26, 2024

HuanzhiMao mentioned this pull request Jul 27, 2024

[BFCL] Leaderboard Update, in sync with #532, #543, #556, and #542 (New Model) #558

Merged

HuanzhiMao mentioned this pull request Jul 31, 2024

Add NousResearch Hermes models to no cost models list #564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Salesforce xLAM handler and fix minor issues #532

Add Salesforce xLAM handler and fix minor issues #532

zuxin666 commented Jul 18, 2024

HuanzhiMao commented Jul 18, 2024

HuanzhiMao left a comment •

edited

Loading

zuxin666 commented Jul 20, 2024

HuanzhiMao commented Jul 21, 2024

ShishirPatil commented Jul 22, 2024

HuanzhiMao commented Jul 25, 2024 •

edited

Loading

HuanzhiMao left a comment

zuxin666 commented Jul 25, 2024

HuanzhiMao commented Jul 25, 2024

zuxin666 commented Jul 25, 2024

HuanzhiMao commented Jul 25, 2024 •

edited

Loading

Add Salesforce xLAM handler and fix minor issues #532

Add Salesforce xLAM handler and fix minor issues #532

Conversation

zuxin666 commented Jul 18, 2024

HuanzhiMao commented Jul 18, 2024

HuanzhiMao left a comment • edited Loading

Choose a reason for hiding this comment

zuxin666 commented Jul 20, 2024

HuanzhiMao commented Jul 21, 2024

ShishirPatil commented Jul 22, 2024

HuanzhiMao commented Jul 25, 2024 • edited Loading

HuanzhiMao left a comment

Choose a reason for hiding this comment

zuxin666 commented Jul 25, 2024

HuanzhiMao commented Jul 25, 2024

zuxin666 commented Jul 25, 2024

HuanzhiMao commented Jul 25, 2024 • edited Loading

HuanzhiMao left a comment •

edited

Loading

HuanzhiMao commented Jul 25, 2024 •

edited

Loading

HuanzhiMao commented Jul 25, 2024 •

edited

Loading