Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] [Core] Support for sharded tensorized models #4990

Merged
merged 15 commits into from
Jun 12, 2024

Conversation

tjohnson31415
Copy link
Contributor

@tjohnson31415 tjohnson31415 commented May 22, 2024

Adds initial support for serializing and loading sharded vLLM models with Tensorizer. The implementation in this PR requires that the number of shards is equal to the number of target devices; there is one .tensors file for each shard. Therefore, the serialized model can only be loaded with the same sharding configuration that it was saved with.

Changes:

  • to support multiple files, TensorizerConfig.tensorizer_uri can now contain a string template that will be formatted with the rank of the shard, eg. .../model-rank-%03d.tensors
  • refactor serialize_vllm_model to take in encryption configuration (via the keyfile path) instead of generating a different key per shard
  • updates examples/tensorizer_vllm_model.py to enable serializing a sharded model to multiple files
  • plumbs through a save_tensorized_model function from GPUExecutor to TensorizerLoader to enable distributed workers to run the serialization (analogous to save_sharded_state used by ShardedStateLoader)
  • Added tensorize_vllm_model utility function to the Tensorizer loader to easily produce a serialized vLLM model with encryption and/or sharding
  • CUDA device index explicitly passed to TensorDeserializer so that tensors are created on the correct device
    • the internal thread pool seems to lose the tracking of the device index when the device is just 'cuda'

FIX #4957

Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I don't see any unit tests though -- am I missing something? Please make sure there are tests for the sharded serialization and loading. Thanks for the contribution!

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
@tjohnson31415 tjohnson31415 force-pushed the sharded-tensorizer branch 2 times, most recently from 770cd7a to d3c7045 Compare May 28, 2024 22:57
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
@tjohnson31415
Copy link
Contributor Author

tjohnson31415 commented May 29, 2024

Thanks for the review @sangstar!
You didn't miss anything, I hadn't included any new tests initially. I was hitting a problem where the process would hang during the test when attempting to create a second engine (i.e. one for serializing and then one for deserializing). I have added a test now though, using a subprocess to write out the serialized files as a workaround. I also did a bit of refactoring to support this, adding a tensorize_vllm_model utility function which abstracts the overall procedure for creating a serialized model. This new function also includes some new comments on the differences between multi-GPU vs single-GPU. Please take another look at your convenience.

Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Some requested changes. I'll give this some testing on my end soon and give you some feedback.

Comment on lines 306 to 308
# FIXME: launching multiple multiprocessing servers within the same program
# results in a hang... launch serialization in a separate process as a work
# around
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will assume this is a WIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the functionality supporting sharding is still useful even if the tests can't all pass together... Hopefully we can find a fix to this issue though. I'm planning to do a bit more digging into it to see if I can find anything, but let me know if you have ideas!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a work-around! The hang seems to be caused by the custom all reduce kernel not being completely cleaned up. The tests no longer hang if I use disable_custom_all_reduce=True. With that change I have removed this FIXME note and changed to skipif for the tests.

@sangstar
Copy link
Contributor

Can you also update tensorize_vllm_model.py in examples/ with instructions for how to use your sharding feature? It's meant to be an unofficial vLLM-Tensorizer usage guide.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! Just a few more nitpicks from me.

@ywang96
Copy link
Member

ywang96 commented May 30, 2024

@tjohnson31415 This is exciting! Have you done any performance eval on the model loading speed gain %?

cc @sroy745 if you have a chance to test this PR out

tjohnson31415 and others added 3 commits May 30, 2024 11:06
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
* upstream/main: (126 commits)
  [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026)
  [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312)
  [Misc] Various simplifications and typing fixes (vllm-project#5368)
  [ci] Fix Buildkite agent path (vllm-project#5392)
  [Doc] Add documentation for FP8 W8A8 (vllm-project#5388)
  Bump version to v0.5.0 (vllm-project#5384)
  [Docs] Alphabetically sort sponsors (vllm-project#5386)
  [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383)
  [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330)
  [ci] Use small_cpu_queue for doc build (vllm-project#5331)
  [Bugfix] Fix LLaVA-NeXT (vllm-project#5380)
  [Feature][Frontend]:  Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319)
  [Model] Initial support for LLaVA-NeXT (vllm-project#4199)
  [Misc] Improve error message when LoRA parsing fails (vllm-project#5194)
  [misc][typo] fix typo (vllm-project#5372)
  [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374)
  [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350)
  [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164)
  [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047)
  [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361)
  ...
@tjohnson31415
Copy link
Contributor Author

@sangstar Thanks for all of the review comments! I believe I have addressed all of them. Let me know if anything else comes up from your testing, or if everything looks good now.

I rebased to resolve conflicts and now some tests are failing, but they seem to be unrelated (eg. jobs getting killed). The "Distributed Tests" are passing for me in my dev environment.


@ywang96. With regards to performance evaluation, I haven't done anything rigorous, but as a quick test:
I used the ibm-granite/granite-34b-code-instruct model sharded across 2 GPUs loaded from in-memory cached local files and looked at the server bootup times from the logs.

  • Baseline without tensorizer with conversion from bfloat16 -> float16: ~30-32s
  • Baseline without tensorizer and without conversion: ~21-22s
  • With tensorizer loading from pre-converted float16: ~16-20s

So a nice speed up over converting the dtype of the tensors during the load with a smaller but still significant benefit even with the dtype conversion removed.

@sroy745 Let me know if you were able to test and have any feedback. Thanks.

@sangstar
Copy link
Contributor

@sangstar Thanks for all of the review comments! I believe I have addressed all of them. Let me know if anything else comes up from your testing, or if everything looks good now.

I rebased to resolve conflicts and now some tests are failing, but they seem to be unrelated (eg. jobs getting killed). The "Distributed Tests" are passing for me in my dev environment.

@ywang96. With regards to performance evaluation, I haven't done anything rigorous, but as a quick test: I used the ibm-granite/granite-34b-code-instruct model sharded across 2 GPUs loaded from in-memory cached local files and looked at the server bootup times from the logs.

  • Baseline without tensorizer with conversion from bfloat16 -> float16: ~30-32s
  • Baseline without tensorizer and without conversion: ~21-22s
  • With tensorizer loading from pre-converted float16: ~16-20s

So a nice speed up over converting the dtype of the tensors during the load with a smaller but still significant benefit even with the dtype conversion removed.

@sroy745 Let me know if you were able to test and have any feedback. Thanks.

Thanks for your awesome PR! LGTM. I can open up a new PR refactoring this once Tensorizer supports distributed serializing and deserializing natively, but this is a good use case of doing so with its current support. I'll defer to @ywang96 and @sroy745 for their reviews.

Copy link
Contributor

@sangstar sangstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested it on my side. Great job! I'll defer to the other reviewers.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks @sangstar for also reviewing this PR! Let's get this in for now.

One small ask to @sangstar (for later): When you do the refactoring, could you also add a short documentation page regarding using tensorizer? I know currently the documentation page refers it to the example script, but it'd be great to have some doc explaining the high-level idea and usage with the reference to the example script. Thanks!

@sangstar
Copy link
Contributor

Overall LGTM! Thanks @sangstar for also reviewing this PR! Let's get this in for now.

One small ask to @sangstar (for later): When you do the refactoring, could you also add a short documentation page regarding using tensorizer? I know currently the documentation page refers it to the example script, but it'd be great to have some doc explaining the high-level idea and usage with the reference to the example script. Thanks!

Absolutely! That's on my and the team's radar. I plan on having a [Docs] PR for this out in the near future.

@sangstar
Copy link
Contributor

sangstar commented Jun 12, 2024

Hey @ywang96 !!

After talking with my team, we're hesitant with doing something substantial with regard to documenting using tensorizer on vLLM's doc page for a number of reasons, notably for maintenance and redundancy concerns (as the example script docstring summarizes things pretty well). I could add a brief overview (couple of sentences) and mostly point to tensorizer's bespoke and maintained documentation if that's desirable. Cheers!

@ywang96
Copy link
Member

ywang96 commented Jun 12, 2024

Hey @ywang96 !!

After talking with my team, we're hesitant with doing something substantial with regard to documenting using tensorizer on vLLM's doc page for a number of reasons, notably for maintenance and redundancy concerns (as the example script docstring summarizes things pretty well). I could add a brief overview (couple of sentences) and mostly point to tensorizer's bespoke and maintained documentation if that's desirable. Cheers!

Thanks for keeping me posted about this - yea that works too!

@ywang96 ywang96 merged commit 51602ee into vllm-project:main Jun 12, 2024
57 of 58 checks passed
@tjohnson31415 tjohnson31415 deleted the sharded-tensorizer branch June 13, 2024 17:04
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 16, 2024
…#4990)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
…t#4990)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
…t#4990)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
…t#4990)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
…t#4990)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Sanger Steel <sangersteel@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Support loading of sharded vLLM serialized models with Tensorizer
3 participants