[Frontend] [Core] Support for sharded tensorized models #4990

tjohnson31415 · 2024-05-22T19:36:17Z

Adds initial support for serializing and loading sharded vLLM models with Tensorizer. The implementation in this PR requires that the number of shards is equal to the number of target devices; there is one .tensors file for each shard. Therefore, the serialized model can only be loaded with the same sharding configuration that it was saved with.

Changes:

to support multiple files, TensorizerConfig.tensorizer_uri can now contain a string template that will be formatted with the rank of the shard, eg. .../model-rank-%03d.tensors
refactor serialize_vllm_model to take in encryption configuration (via the keyfile path) instead of generating a different key per shard
updates examples/tensorizer_vllm_model.py to enable serializing a sharded model to multiple files
plumbs through a save_tensorized_model function from GPUExecutor to TensorizerLoader to enable distributed workers to run the serialization (analogous to save_sharded_state used by ShardedStateLoader)
Added tensorize_vllm_model utility function to the Tensorizer loader to easily produce a serialized vLLM model with encryption and/or sharding
CUDA device index explicitly passed to TensorDeserializer so that tensors are created on the correct device
- the internal thread pool seems to lose the tracking of the device index when the device is just 'cuda'

FIX #4957

sangstar

Generally LGTM. I don't see any unit tests though -- am I missing something? Please make sure there are tests for the sharded serialization and loading. Thanks for the contribution!

examples/tensorize_vllm_model.py

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 · 2024-05-29T07:12:14Z

Thanks for the review @sangstar!
You didn't miss anything, I hadn't included any new tests initially. I was hitting a problem where the process would hang during the test when attempting to create a second engine (i.e. one for serializing and then one for deserializing). I have added a test now though, using a subprocess to write out the serialized files as a workaround. I also did a bit of refactoring to support this, adding a tensorize_vllm_model utility function which abstracts the overall procedure for creating a serialized model. This new function also includes some new comments on the differences between multi-GPU vs single-GPU. Please take another look at your convenience.

sangstar

Great work! Some requested changes. I'll give this some testing on my end soon and give you some feedback.

vllm/model_executor/model_loader/tensorizer.py

tests/tensorizer_loader/test_tensorizer.py

sangstar · 2024-05-29T16:30:48Z

tests/tensorizer_loader/test_tensorizer.py

+    # FIXME: launching multiple multiprocessing servers within the same program
+    # results in a hang... launch serialization in a separate process as a work
+    # around


Will assume this is a WIP

I think the functionality supporting sharding is still useful even if the tests can't all pass together... Hopefully we can find a fix to this issue though. I'm planning to do a bit more digging into it to see if I can find anything, but let me know if you have ideas!

Found a work-around! The hang seems to be caused by the custom all reduce kernel not being completely cleaned up. The tests no longer hang if I use disable_custom_all_reduce=True. With that change I have removed this FIXME note and changed to skipif for the tests.

sangstar · 2024-05-29T17:13:48Z

Can you also update tensorize_vllm_model.py in examples/ with instructions for how to use your sharding feature? It's meant to be an unofficial vLLM-Tensorizer usage guide.

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

sangstar

Looking great! Just a few more nitpicks from me.

vllm/model_executor/model_loader/tensorizer.py

ywang96 · 2024-05-30T16:48:00Z

@tjohnson31415 This is exciting! Have you done any performance eval on the model loading speed gain %?

cc @sroy745 if you have a chance to test this PR out

Co-authored-by: Sanger Steel <sangersteel@gmail.com>

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

* upstream/main: (126 commits) [Bugfix][Frontend] Cleanup "fix chat logprobs" (vllm-project#5026) [Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (vllm-project#5312) [Misc] Various simplifications and typing fixes (vllm-project#5368) [ci] Fix Buildkite agent path (vllm-project#5392) [Doc] Add documentation for FP8 W8A8 (vllm-project#5388) Bump version to v0.5.0 (vllm-project#5384) [Docs] Alphabetically sort sponsors (vllm-project#5386) [Docs] Add Docs on Limitations of VLM Support (vllm-project#5383) [ci] Mount buildkite agent on Docker container to upload benchmark results (vllm-project#5330) [ci] Use small_cpu_queue for doc build (vllm-project#5331) [Bugfix] Fix LLaVA-NeXT (vllm-project#5380) [Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (vllm-project#5319) [Model] Initial support for LLaVA-NeXT (vllm-project#4199) [Misc] Improve error message when LoRA parsing fails (vllm-project#5194) [misc][typo] fix typo (vllm-project#5372) [Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (vllm-project#5374) [Misc] Update to comply with the new `compressed-tensors` config (vllm-project#5350) [Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164) [Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (vllm-project#5047) [mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-project#5361) ...

tjohnson31415 · 2024-06-11T21:05:18Z

@sangstar Thanks for all of the review comments! I believe I have addressed all of them. Let me know if anything else comes up from your testing, or if everything looks good now.

I rebased to resolve conflicts and now some tests are failing, but they seem to be unrelated (eg. jobs getting killed). The "Distributed Tests" are passing for me in my dev environment.

@ywang96. With regards to performance evaluation, I haven't done anything rigorous, but as a quick test:
I used the ibm-granite/granite-34b-code-instruct model sharded across 2 GPUs loaded from in-memory cached local files and looked at the server bootup times from the logs.

Baseline without tensorizer with conversion from bfloat16 -> float16: ~30-32s
Baseline without tensorizer and without conversion: ~21-22s
With tensorizer loading from pre-converted float16: ~16-20s

So a nice speed up over converting the dtype of the tensors during the load with a smaller but still significant benefit even with the dtype conversion removed.

@sroy745 Let me know if you were able to test and have any feedback. Thanks.

sangstar · 2024-06-12T00:02:35Z

@sangstar Thanks for all of the review comments! I believe I have addressed all of them. Let me know if anything else comes up from your testing, or if everything looks good now.

I rebased to resolve conflicts and now some tests are failing, but they seem to be unrelated (eg. jobs getting killed). The "Distributed Tests" are passing for me in my dev environment.

@ywang96. With regards to performance evaluation, I haven't done anything rigorous, but as a quick test: I used the ibm-granite/granite-34b-code-instruct model sharded across 2 GPUs loaded from in-memory cached local files and looked at the server bootup times from the logs.

Baseline without tensorizer with conversion from bfloat16 -> float16: ~30-32s

Baseline without tensorizer and without conversion: ~21-22s

With tensorizer loading from pre-converted float16: ~16-20s

So a nice speed up over converting the dtype of the tensors during the load with a smaller but still significant benefit even with the dtype conversion removed.

@sroy745 Let me know if you were able to test and have any feedback. Thanks.

Thanks for your awesome PR! LGTM. I can open up a new PR refactoring this once Tensorizer supports distributed serializing and deserializing natively, but this is a good use case of doing so with its current support. I'll defer to @ywang96 and @sroy745 for their reviews.

sangstar

Just tested it on my side. Great job! I'll defer to the other reviewers.

ywang96

Overall LGTM! Thanks @sangstar for also reviewing this PR! Let's get this in for now.

One small ask to @sangstar (for later): When you do the refactoring, could you also add a short documentation page regarding using tensorizer? I know currently the documentation page refers it to the example script, but it'd be great to have some doc explaining the high-level idea and usage with the reference to the example script. Thanks!

sangstar · 2024-06-12T18:03:44Z

Overall LGTM! Thanks @sangstar for also reviewing this PR! Let's get this in for now.

One small ask to @sangstar (for later): When you do the refactoring, could you also add a short documentation page regarding using tensorizer? I know currently the documentation page refers it to the example script, but it'd be great to have some doc explaining the high-level idea and usage with the reference to the example script. Thanks!

Absolutely! That's on my and the team's radar. I plan on having a [Docs] PR for this out in the near future.

sangstar · 2024-06-12T19:25:30Z

Hey @ywang96 !!

After talking with my team, we're hesitant with doing something substantial with regard to documenting using tensorizer on vLLM's doc page for a number of reasons, notably for maintenance and redundancy concerns (as the example script docstring summarizes things pretty well). I could add a brief overview (couple of sentences) and mostly point to tensorizer's bespoke and maintained documentation if that's desirable. Cheers!

ywang96 · 2024-06-12T19:52:44Z

Hey @ywang96 !!

After talking with my team, we're hesitant with doing something substantial with regard to documenting using tensorizer on vLLM's doc page for a number of reasons, notably for maintenance and redundancy concerns (as the example script docstring summarizes things pretty well). I could add a brief overview (couple of sentences) and mostly point to tensorizer's bespoke and maintained documentation if that's desirable. Cheers!

Thanks for keeping me posted about this - yea that works too!

…#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>

…t#4990) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>

tjohnson31415 mentioned this pull request May 22, 2024

[Feature]: Support loading of sharded vLLM serialized models with Tensorizer #4957

Closed

sangstar reviewed May 23, 2024

View reviewed changes

examples/tensorize_vllm_model.py Outdated Show resolved Hide resolved

tjohnson31415 added 5 commits May 28, 2024 16:53

feat: support for sharded tensorized models

9a452ee

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

refactor: plumb through save_tensorized_model to remove monkey patch

2616f36

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

test/fix: update test_tensorizer

7b54dd2

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

test: add tensorizer tests with sharding

1441301

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

refactor: remove is_sharded from TensorizerConfig

1d7d5c0

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 force-pushed the sharded-tensorizer branch 2 times, most recently from 770cd7a to d3c7045 Compare May 28, 2024 22:57

refactor: write and use tensorizer_vllm_model util function

290489c

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

tjohnson31415 force-pushed the sharded-tensorizer branch from d3c7045 to 290489c Compare May 28, 2024 23:00

sangstar suggested changes May 29, 2024

View reviewed changes

tjohnson31415 added 5 commits May 29, 2024 14:02

refactor: little bit of cleanup

0316033

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

fix: add tensor parallel arg to deserialize()

1ab2973

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

refactor: add is_sharded back to TensorizerConfig

c7a522a

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

doc: add comment about sharding support in tensorize_vllm_model

3407ff7

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

test: work around hang with disable_custom_all_reduce and use skipif

412467f

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

sangstar reviewed May 30, 2024

View reviewed changes

vllm/model_executor/model_loader/tensorizer.py Outdated Show resolved Hide resolved

vllm/model_executor/model_loader/tensorizer.py Outdated Show resolved Hide resolved

tjohnson31415 and others added 3 commits May 30, 2024 11:06

Apply suggestions from code review

b8bc366

Co-authored-by: Sanger Steel <sangersteel@gmail.com>

fmt/doc: minor changes

102c8a1

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

sangstar approved these changes Jun 12, 2024

View reviewed changes

ywang96 approved these changes Jun 12, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into sharded-tensorizer

17a8db3

ywang96 merged commit 51602ee into vllm-project:main Jun 12, 2024
57 of 58 checks passed

tjohnson31415 deleted the sharded-tensorizer branch June 13, 2024 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend] [Core] Support for sharded tensorized models #4990

[Frontend] [Core] Support for sharded tensorized models #4990

tjohnson31415 commented May 22, 2024 •

edited

Loading

sangstar left a comment •

edited

Loading

tjohnson31415 commented May 29, 2024 •

edited

Loading

sangstar left a comment

sangstar May 29, 2024

tjohnson31415 May 29, 2024

tjohnson31415 May 30, 2024

sangstar commented May 29, 2024

sangstar left a comment

ywang96 commented May 30, 2024

tjohnson31415 commented Jun 11, 2024

sangstar commented Jun 12, 2024

sangstar left a comment •

edited

Loading

ywang96 left a comment

sangstar commented Jun 12, 2024

sangstar commented Jun 12, 2024 •

edited

Loading

ywang96 commented Jun 12, 2024

[Frontend] [Core] Support for sharded tensorized models #4990

[Frontend] [Core] Support for sharded tensorized models #4990

Conversation

tjohnson31415 commented May 22, 2024 • edited Loading

sangstar left a comment • edited Loading

Choose a reason for hiding this comment

tjohnson31415 commented May 29, 2024 • edited Loading

sangstar left a comment

Choose a reason for hiding this comment

sangstar May 29, 2024

Choose a reason for hiding this comment

tjohnson31415 May 29, 2024

Choose a reason for hiding this comment

tjohnson31415 May 30, 2024

Choose a reason for hiding this comment

sangstar commented May 29, 2024

sangstar left a comment

Choose a reason for hiding this comment

ywang96 commented May 30, 2024

tjohnson31415 commented Jun 11, 2024

sangstar commented Jun 12, 2024

sangstar left a comment • edited Loading

Choose a reason for hiding this comment

ywang96 left a comment

Choose a reason for hiding this comment

sangstar commented Jun 12, 2024

sangstar commented Jun 12, 2024 • edited Loading

ywang96 commented Jun 12, 2024

tjohnson31415 commented May 22, 2024 •

edited

Loading

sangstar left a comment •

edited

Loading

tjohnson31415 commented May 29, 2024 •

edited

Loading

sangstar left a comment •

edited

Loading

sangstar commented Jun 12, 2024 •

edited

Loading