-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] [Core] Support for sharded tensorized models #4990
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. I don't see any unit tests though -- am I missing something? Please make sure there are tests for the sharded serialization and loading. Thanks for the contribution!
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
770cd7a
to
d3c7045
Compare
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
d3c7045
to
290489c
Compare
Thanks for the review @sangstar! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Some requested changes. I'll give this some testing on my end soon and give you some feedback.
# FIXME: launching multiple multiprocessing servers within the same program | ||
# results in a hang... launch serialization in a separate process as a work | ||
# around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will assume this is a WIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the functionality supporting sharding is still useful even if the tests can't all pass together... Hopefully we can find a fix to this issue though. I'm planning to do a bit more digging into it to see if I can find anything, but let me know if you have ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a work-around! The hang seems to be caused by the custom all reduce kernel not being completely cleaned up. The tests no longer hang if I use disable_custom_all_reduce=True
. With that change I have removed this FIXME note and changed to skipif for the tests.
Can you also update |
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! Just a few more nitpicks from me.
@tjohnson31415 This is exciting! Have you done any performance eval on the model loading speed gain %? cc @sroy745 if you have a chance to test this PR out |
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
* upstream/main: (126 commits) [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026) [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312) [Misc] Various simplifications and typing fixes (vllm-project#5368) [ci] Fix Buildkite agent path (vllm-project#5392) [Doc] Add documentation for FP8 W8A8 (vllm-project#5388) Bump version to v0.5.0 (vllm-project#5384) [Docs] Alphabetically sort sponsors (vllm-project#5386) [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383) [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330) [ci] Use small_cpu_queue for doc build (vllm-project#5331) [Bugfix] Fix LLaVA-NeXT (vllm-project#5380) [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319) [Model] Initial support for LLaVA-NeXT (vllm-project#4199) [Misc] Improve error message when LoRA parsing fails (vllm-project#5194) [misc][typo] fix typo (vllm-project#5372) [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374) [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350) [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164) [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047) [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361) ...
@sangstar Thanks for all of the review comments! I believe I have addressed all of them. Let me know if anything else comes up from your testing, or if everything looks good now. I rebased to resolve conflicts and now some tests are failing, but they seem to be unrelated (eg. jobs getting killed). The "Distributed Tests" are passing for me in my dev environment. @ywang96. With regards to performance evaluation, I haven't done anything rigorous, but as a quick test:
So a nice speed up over converting the dtype of the tensors during the load with a smaller but still significant benefit even with the dtype conversion removed. @sroy745 Let me know if you were able to test and have any feedback. Thanks. |
Thanks for your awesome PR! LGTM. I can open up a new PR refactoring this once Tensorizer supports distributed serializing and deserializing natively, but this is a good use case of doing so with its current support. I'll defer to @ywang96 and @sroy745 for their reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested it on my side. Great job! I'll defer to the other reviewers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks @sangstar for also reviewing this PR! Let's get this in for now.
One small ask to @sangstar (for later): When you do the refactoring, could you also add a short documentation page regarding using tensorizer
? I know currently the documentation page refers it to the example script, but it'd be great to have some doc explaining the high-level idea and usage with the reference to the example script. Thanks!
Absolutely! That's on my and the team's radar. I plan on having a [Docs] PR for this out in the near future. |
Hey @ywang96 !! After talking with my team, we're hesitant with doing something substantial with regard to documenting using |
Thanks for keeping me posted about this - yea that works too! |
…#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
…t#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
…t#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
…t#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
…t#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
Adds initial support for serializing and loading sharded vLLM models with Tensorizer. The implementation in this PR requires that the number of shards is equal to the number of target devices; there is one
.tensors
file for each shard. Therefore, the serialized model can only be loaded with the same sharding configuration that it was saved with.Changes:
TensorizerConfig.tensorizer_uri
can now contain a string template that will be formatted with the rank of the shard, eg..../model-rank-%03d.tensors
serialize_vllm_model
to take in encryption configuration (via the keyfile path) instead of generating a different key per shardexamples/tensorizer_vllm_model.py
to enable serializing a sharded model to multiple filessave_tensorized_model
function fromGPUExecutor
toTensorizerLoader
to enable distributed workers to run the serialization (analogous tosave_sharded_state
used byShardedStateLoader
)tensorize_vllm_model
utility function to the Tensorizer loader to easily produce a serialized vLLM model with encryption and/or shardingTensorDeserializer
so that tensors are created on the correct device'cuda'
FIX #4957