Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Add quantization config support for speculative model. #7343

Merged

Conversation

ShangmingCai
Copy link
Contributor

Currently, we are not able to specify the quantization method for the draft model. Yet when we set up an INT4 gptq-based draft model, vllm will automatically use gptq_marlin to config the quantization and may raise some errors such as
ValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 70, zp = False).

This pr adds a config arg so that we can explicitly specify the quantization method for the draft model.

Copy link

github-actions bot commented Aug 9, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@ShangmingCai ShangmingCai changed the title [Misc] Add quantization config support for speculative mode. [Misc] Add quantization config support for speculative model. Aug 9, 2024
@ShangmingCai
Copy link
Contributor Author

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2024
@ShangmingCai
Copy link
Contributor Author

I think (#7240) is a related issue. The difference is that this is about the draft model and we are not able to explicitly specify the quantization method for the draft model to work around currently.

I believe whether (#7264) address this bug of marlin or not, we still need this feature in case anything happens again.
@robertgshaw2-neuralmagic Do you have time to check on this pr?

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Aug 12, 2024

Can you update to latest main? We fixed this issue so it should automatically detect that marlin is not supported on 75

otherwise, what is the use case for explicitly setting the quantization ?

@ShangmingCai
Copy link
Contributor Author

Can you update to latest main? We fixed this issue so it should automatically detect that marlin is not supported on 75

otherwise, what is the use case for explicitly setting the quantization ?

Yes, the main branch version is back to normal. Awesome work for the quick fix.

Regarding explicit specification of the quantization method for the draft model, it is not a must since there is an automatic detection mechanism for now, but being able to explicitly specify it can be a fallback plan in the production environment version compared to not being able to do so. In addition, when vllm supports configurable quantization of the full-precision model at runtime, the draft model can also have such flexibility since the draft model is way more latency-sensitive.

Finally, there is a 'TODO' in the source code of config.py, informing us that we might need this.

        # TODO: The user should be able to specify revision/quantization/max
        # model len for the draft model. It is not currently supported.

I think the quantization config is the most urgent one since we cannot work around this issue for v0.5.4 this release version and have to revert to the earlier stable version.

@@ -1208,7 +1212,7 @@ def create_draft_parallel_config(
elif speculative_draft_tensor_parallel_size != 1:
# TODO(wooyeon): allow tp values larger than 1
raise ValueError(
f"{speculative_draft_tensor_parallel_size=} cannot be"
f"{speculative_draft_tensor_parallel_size=} cannot be "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a typo I found when I was working on(#6300). We need this space or we will print 'cannot beother value than 1'. The right one is supposed to be 'cannot be other value than 1'.

@robertgshaw2-neuralmagic
Copy link
Collaborator

Sounds reasonable. Can you add a test case?

@ShangmingCai
Copy link
Contributor Author

Sounds reasonable. Can you add a test case?

Sure. But I am not sure where I should put the test code in. Is it suitable to add this in the tests/test_config.py?

@cadedaniel
Copy link
Collaborator

You can add a test here: https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_integration.py

@ShangmingCai
Copy link
Contributor Author

Can you add a test case?

@robertgshaw2-neuralmagic The test has been added. Since the changed code simply passes the user's configuration during the initialization process of draft_model_config, these tests of e2e integration should be sufficient.

Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test looks good to me

@ShangmingCai
Copy link
Contributor Author

@DarkLight1337 Sorry to bother you. Do you know how to retrigger only the test for 'ci-aws/pr/2-node-tests-4-gpus-in-total'? I can't find a way to do this. Some 'ci-aws' tests seem to be unstable, and retriggering all the tests might be a bit of a waste of money.

@DarkLight1337
Copy link
Member

The test is broken right now, I'll ask someone to force-merge the PR.

@youkaichao youkaichao merged commit b67ae00 into vllm-project:main Aug 16, 2024
49 of 51 checks passed
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants