Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Support Category-Specific Generation for OSS Model, Remove eval_data_compilation Step #512

Merged
merged 10 commits into from
Jul 17, 2024

Conversation

HuanzhiMao
Copy link
Collaborator

@HuanzhiMao HuanzhiMao commented Jul 8, 2024

Currently, for OSS models, users must run eval_data_compilation.py to merge all datasets into a single data_total.json file before running inference. This requires inferring on all datasets at once, without the option to run inference on individual datasets or subsets. This PR addresses this limitation by allowing users to perform inference on specific datasets directly, removing the need for the eval_data_compilation step.
Note: hosted models don't have this limitation.

Partially addresses #501 and #502.

@HuanzhiMao HuanzhiMao marked this pull request as ready for review July 8, 2024 22:56
Copy link
Collaborator

@CharlieJCJ CharlieJCJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on Llama3-8B for data generation and eval checker, LGTM

@ShishirPatil ShishirPatil merged commit 951c728 into ShishirPatil:main Jul 17, 2024
ShishirPatil pushed a commit that referenced this pull request Jul 20, 2024
… PR Merge (#536)

On July 16th, PR #516 and PR #512 were merged first. They introduced
fixes that should be applied to all model handlers. Shortly after, on
the same day, PR #525 was merged. The new model handler introduced in PR
#525 is missing the fixes from the previous two merged PRs (it wasn't
synced accordingly). This PR addresses this issue by applying the
necessary fixes to the new model handler.
ShishirPatil pushed a commit that referenced this pull request Jul 24, 2024
…taset; Handle vLLM Benign Error (#540)

In this PR:

1. **Support Multi-Model Multi-Category Generation**:
- The `openfunctions_evaluation.py` can now take a list of model names
and a list of test categories as command line input.
   - Partially address #501.

2. **Handling vLLM's Error**:
- A benign error would occur during the cleanup phase after completing a
generation task, causing the pipeline to fail despite generating model
results. This issue stems from vLLM and is outside our control. [See
this issue](vllm-project/vllm#6145) from the
vLLM repo.
- This is annoying because when users attempt category-specific
generation for locally-hosted models (as supported in #512), only the
first category result for the first model is generated since the error
occurs immediately after.
- To improve the user experience, we now combine all selected test
categories into one task and submit that single task to vLLM, splitting
the results afterwards.
- Note: If multiple locally-hosted models are queued for inference, only
the tasks of the first model will complete. Subsequent tasks will still
fail due to the cleanup phase error from the first model. Therefore, we
recommend running the inference command for one model at a time until
vLLM rolls out the fix.

3. **Adding Index to Dataset**:
- Each test file and possible_answer file now includes an index to help
match entries.
  
This PR **will not** affect the leaderboard score.
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
…l_data_compilation Step (ShishirPatil#512)

Currently, for OSS models, users must run `eval_data_compilation.py` to
merge all datasets into a single `data_total.json` file before running
inference. This requires inferring on all datasets at once, without the
option to run inference on individual datasets or subsets. This PR
addresses this limitation by allowing users to perform inference on
specific datasets directly, removing the need for the
`eval_data_compilation` step.
Note: hosted models don't have this limitation. 

Partially addresses ShishirPatil#501 and ShishirPatil#502.
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
… PR Merge (ShishirPatil#536)

On July 16th, PR ShishirPatil#516 and PR ShishirPatil#512 were merged first. They introduced
fixes that should be applied to all model handlers. Shortly after, on
the same day, PR ShishirPatil#525 was merged. The new model handler introduced in PR
ShishirPatil#525 is missing the fixes from the previous two merged PRs (it wasn't
synced accordingly). This PR addresses this issue by applying the
necessary fixes to the new model handler.
aw632 pushed a commit to vinaybagade/gorilla that referenced this pull request Aug 22, 2024
…taset; Handle vLLM Benign Error (ShishirPatil#540)

In this PR:

1. **Support Multi-Model Multi-Category Generation**:
- The `openfunctions_evaluation.py` can now take a list of model names
and a list of test categories as command line input.
   - Partially address ShishirPatil#501.

2. **Handling vLLM's Error**:
- A benign error would occur during the cleanup phase after completing a
generation task, causing the pipeline to fail despite generating model
results. This issue stems from vLLM and is outside our control. [See
this issue](vllm-project/vllm#6145) from the
vLLM repo.
- This is annoying because when users attempt category-specific
generation for locally-hosted models (as supported in ShishirPatil#512), only the
first category result for the first model is generated since the error
occurs immediately after.
- To improve the user experience, we now combine all selected test
categories into one task and submit that single task to vLLM, splitting
the results afterwards.
- Note: If multiple locally-hosted models are queued for inference, only
the tasks of the first model will complete. Subsequent tasks will still
fail due to the cleanup phase error from the first model. Therefore, we
recommend running the inference command for one model at a time until
vLLM rolls out the fix.

3. **Adding Index to Dataset**:
- Each test file and possible_answer file now includes an index to help
match entries.
  
This PR **will not** affect the leaderboard score.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants