-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add quantization config support for speculative model. #7343
[Misc] Add quantization config support for speculative model. #7343
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
/ready |
I think (#7240) is a related issue. The difference is that this is about the draft model and we are not able to explicitly specify the quantization method for the draft model to work around currently. I believe whether (#7264) address this bug of marlin or not, we still need this feature in case anything happens again. |
Can you update to latest main? We fixed this issue so it should automatically detect that marlin is not supported on 75 otherwise, what is the use case for explicitly setting the quantization ? |
Yes, the main branch version is back to normal. Awesome work for the quick fix. Regarding explicit specification of the quantization method for the draft model, it is not a must since there is an automatic detection mechanism for now, but being able to explicitly specify it can be a fallback plan in the production environment version compared to not being able to do so. In addition, when vllm supports configurable quantization of the full-precision model at runtime, the draft model can also have such flexibility since the draft model is way more latency-sensitive. Finally, there is a 'TODO' in the source code of config.py, informing us that we might need this.
I think the quantization config is the most urgent one since we cannot work around this issue for v0.5.4 this release version and have to revert to the earlier stable version. |
@@ -1208,7 +1212,7 @@ def create_draft_parallel_config( | |||
elif speculative_draft_tensor_parallel_size != 1: | |||
# TODO(wooyeon): allow tp values larger than 1 | |||
raise ValueError( | |||
f"{speculative_draft_tensor_parallel_size=} cannot be" | |||
f"{speculative_draft_tensor_parallel_size=} cannot be " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a typo I found when I was working on(#6300). We need this space or we will print 'cannot beother value than 1'. The right one is supposed to be 'cannot be other value than 1'.
Sounds reasonable. Can you add a test case? |
Sure. But I am not sure where I should put the test code in. Is it suitable to add this in the tests/test_config.py? |
You can add a test here: https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_integration.py |
@robertgshaw2-neuralmagic The test has been added. Since the changed code simply passes the user's configuration during the initialization process of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test looks good to me
@DarkLight1337 Sorry to bother you. Do you know how to retrigger only the test for 'ci-aws/pr/2-node-tests-4-gpus-in-total'? I can't find a way to do this. Some 'ci-aws' tests seem to be unstable, and retriggering all the tests might be a bit of a waste of money. |
The test is broken right now, I'll ask someone to force-merge the PR. |
…roject#7343) Signed-off-by: Alvant <alvasian@yandex.ru>
Currently, we are not able to specify the quantization method for the draft model. Yet when we set up an INT4 gptq-based draft model, vllm will automatically use
gptq_marlin
to config the quantization and may raise some errors such asValueError: Marlin does not support weight_bits = uint4b8. Only types = [] are supported (for group_size = 128, min_capability = 70, zp = False).
This pr adds a config arg so that we can explicitly specify the quantization method for the draft model.