[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

HuanzhiMao · 2024-07-08T06:22:29Z

Currently, for OSS models, users must run eval_data_compilation.py to merge all datasets into a single data_total.json file before running inference. This requires inferring on all datasets at once, without the option to run inference on individual datasets or subsets. This PR addresses this limitation by allowing users to perform inference on specific datasets directly, removing the need for the eval_data_compilation step.
Note: hosted models don't have this limitation.

Partially addresses #501 and #502.

CharlieJCJ

Tested on Llama3-8B for data generation and eval checker, LGTM

… PR Merge (#536) On July 16th, PR #516 and PR #512 were merged first. They introduced fixes that should be applied to all model handlers. Shortly after, on the same day, PR #525 was merged. The new model handler introduced in PR #525 is missing the fixes from the previous two merged PRs (it wasn't synced accordingly). This PR addresses this issue by applying the necessary fixes to the new model handler.

…taset; Handle vLLM Benign Error (#540) In this PR: 1. **Support Multi-Model Multi-Category Generation**: - The `openfunctions_evaluation.py` can now take a list of model names and a list of test categories as command line input. - Partially address #501. 2. **Handling vLLM's Error**: - A benign error would occur during the cleanup phase after completing a generation task, causing the pipeline to fail despite generating model results. This issue stems from vLLM and is outside our control. [See this issue](vllm-project/vllm#6145) from the vLLM repo. - This is annoying because when users attempt category-specific generation for locally-hosted models (as supported in #512), only the first category result for the first model is generated since the error occurs immediately after. - To improve the user experience, we now combine all selected test categories into one task and submit that single task to vLLM, splitting the results afterwards. - Note: If multiple locally-hosted models are queued for inference, only the tasks of the first model will complete. Subsequent tasks will still fail due to the cleanup phase error from the first model. Therefore, we recommend running the inference command for one model at a time until vLLM rolls out the fix. 3. **Adding Index to Dataset**: - Each test file and possible_answer file now includes an index to help match entries. This PR **will not** affect the leaderboard score.

…l_data_compilation Step (ShishirPatil#512) Currently, for OSS models, users must run `eval_data_compilation.py` to merge all datasets into a single `data_total.json` file before running inference. This requires inferring on all datasets at once, without the option to run inference on individual datasets or subsets. This PR addresses this limitation by allowing users to perform inference on specific datasets directly, removing the need for the `eval_data_compilation` step. Note: hosted models don't have this limitation. Partially addresses ShishirPatil#501 and ShishirPatil#502.

… PR Merge (ShishirPatil#536) On July 16th, PR ShishirPatil#516 and PR ShishirPatil#512 were merged first. They introduced fixes that should be applied to all model handlers. Shortly after, on the same day, PR ShishirPatil#525 was merged. The new model handler introduced in PR ShishirPatil#525 is missing the fixes from the previous two merged PRs (it wasn't synced accordingly). This PR addresses this issue by applying the necessary fixes to the new model handler.

…taset; Handle vLLM Benign Error (ShishirPatil#540) In this PR: 1. **Support Multi-Model Multi-Category Generation**: - The `openfunctions_evaluation.py` can now take a list of model names and a list of test categories as command line input. - Partially address ShishirPatil#501. 2. **Handling vLLM's Error**: - A benign error would occur during the cleanup phase after completing a generation task, causing the pipeline to fail despite generating model results. This issue stems from vLLM and is outside our control. [See this issue](vllm-project/vllm#6145) from the vLLM repo. - This is annoying because when users attempt category-specific generation for locally-hosted models (as supported in ShishirPatil#512), only the first category result for the first model is generated since the error occurs immediately after. - To improve the user experience, we now combine all selected test categories into one task and submit that single task to vLLM, splitting the results afterwards. - Note: If multiple locally-hosted models are queued for inference, only the tasks of the first model will complete. Subsequent tasks will still fail due to the cleanup phase error from the first model. Therefore, we recommend running the inference command for one model at a time until vLLM rolls out the fix. 3. **Adding Index to Dataset**: - Each test file and possible_answer file now includes an index to help match entries. This PR **will not** affect the leaderboard score.

HuanzhiMao added 3 commits July 7, 2024 23:07

update README

e3d1fe9

let oss_handler take in test_question instead of question file path

24f7e4d

patch openfunctions_evaluation.py

405df22

HuanzhiMao marked this pull request as ready for review July 8, 2024 22:56

HuanzhiMao added 7 commits July 8, 2024 15:57

update wording

c40b167

update changelog

7b06f52

Merge branch 'main' into main

6141f5e

typo fix

68bec5a

update result file name

b3bebf4

fix oss result entry name

a86280c

fix typo

c8a2029

CharlieJCJ approved these changes Jul 14, 2024

View reviewed changes

ShishirPatil approved these changes Jul 17, 2024

View reviewed changes

ShishirPatil merged commit 951c728 into ShishirPatil:main Jul 17, 2024

This was referenced Jul 19, 2024

Clarify Documentation About Running The Benchmark #502

Open

[BFCL] Apply Fix to Newly Introduced Model Handler Missed in Previous PR Merge #536

Merged

HuanzhiMao mentioned this pull request Jul 22, 2024

[BFCL] Support Multi-Model Multi-Category Generation; Add Index to Dataset; Handle vLLM Benign Error #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

HuanzhiMao commented Jul 8, 2024 •

edited

Loading

CharlieJCJ left a comment •

edited

Loading

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

Conversation

HuanzhiMao commented Jul 8, 2024 • edited Loading

CharlieJCJ left a comment • edited Loading

Choose a reason for hiding this comment

HuanzhiMao commented Jul 8, 2024 •

edited

Loading

CharlieJCJ left a comment •

edited

Loading