[Transform] Better dispatch support for offloaded and multi-gpu #423

kylesayrs · 2025-08-14T18:22:05Z

Purpose

In order for vLLM to properly load transform weights, it must be capable of constructing sharing tensors in a way that is independent of which device map was used during compression by LLM Compressor.
- Right now, the construction of shared tensors is dependent on the device of the parent module, which is determined by the device map used by LLM Compressor
- If LLM Compressor decides to generate transforms on a model which is split on two GPUs, the transform will be generated on one GPU and then moved to the other GPU at runtime
Construct weights based on the precision of the scheme, not the precision of the weight
- This also simplifies the key used by vLLM loading
- The dtype will still work due to how we how upcast dtypes when applying weights
Support creating transforms for both offloaded (sequential) and multi-gpu (basic) dispatches

Changes

Use scheme.precision rather than module.dtype when constructing parameters
Do not support different devices when constructing transform weights. Instead use first device seen (typically CPU) and ensure future devices match that device
Fix get_offloaded_device in the case that the module is not offloaded (such as attention)
Add a TQDM for transforms

Testing

Transform precision and shared tensors tests pass
Ran quip example to completion with sequential and basic pipelines
Quip correctness tests pass

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

src/compressed_tensors/transform/factory/hadamard.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…fy-key

brian-dellabetta

approving with question

src/compressed_tensors/transform/factory/hadamard.py

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

brian-dellabetta

thanks!

dsikka

Instead use first device seen (typically CPU) and ensure future devices match that device

How is this ensured?

src/compressed_tensors/utils/offload.py

kylesayrs · 2025-09-02T14:19:47Z

@dsikka This is ensured by moving the weight to the device of the value. https://github.com/neuralmagic/compressed-tensors/pull/423/files#diff-be313d6f55c99277b8d747f1d5470f9cf0f08d99ffdb1fc45ac8df80f8784c59R114

This to call only has an effect when the model is dispatched to multiple GPUs. This means that multi-gpu transforms will be supported, but will have a runtime cost. Our examples only focus on single-gpu/ offloaded transforms for now.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…-project#423) * key by weight only Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * always return on CPU, onload at runtime Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * fix get_offloaded_device Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * reduce diff Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * reduce diff Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * reduce diff Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * move to device to support pipeline parallel Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * eagerly generate with precision Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> * add comment Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

key by weight only

f8f7156

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/transform-simplify-key branch from e68f4f7 to f8f7156 Compare August 14, 2025 18:25

kylesayrs commented Aug 14, 2025

View reviewed changes

src/compressed_tensors/transform/factory/hadamard.py Show resolved Hide resolved

kylesayrs changed the title ~~[Transform] Simplify weight construction keys~~ [Transform] Guard against multi-gpu transforms Aug 14, 2025

kylesayrs added 7 commits August 21, 2025 21:22

always return on CPU, onload at runtime

6929f16

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix get_offloaded_device

13cb9e3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

253df57

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

bc11789

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

reduce diff

d37251a

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

move to device to support pipeline parallel

5590e28

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Merge remote-tracking branch 'origin' into kylesayrs/transform-simpli…

f485af6

…fy-key

kylesayrs changed the title ~~[Transform] Guard against multi-gpu transforms~~ [Transform] Better dispatch support for transforms Aug 26, 2025

kylesayrs changed the title ~~[Transform] Better dispatch support for transforms~~ [Transform] Better dispatch support for offloaded and mult-gpu Aug 26, 2025

kylesayrs changed the title ~~[Transform] Better dispatch support for offloaded and mult-gpu~~ [Transform] Better dispatch support for offloaded and multi-gpu Aug 26, 2025

kylesayrs mentioned this pull request Aug 26, 2025

[Transform] Attention/Cache transforms #436

Merged

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

src/compressed_tensors/transform/factory/hadamard.py Show resolved Hide resolved

eagerly generate with precision

0914f6f

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 0914f6f August 27, 2025 15:09

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

dsikka reviewed Sep 2, 2025

View reviewed changes

src/compressed_tensors/utils/offload.py Show resolved Hide resolved

add comment

2a8b428

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs dismissed brian-dellabetta’s stale review via 2a8b428 September 2, 2025 14:23

brian-dellabetta approved these changes Sep 2, 2025

View reviewed changes

kylesayrs mentioned this pull request Sep 8, 2025

[Transform] Support loading random hadamards on meta device #445

Merged

dsikka approved these changes Sep 8, 2025

View reviewed changes

dsikka merged commit 7734cce into main Sep 8, 2025
2 checks passed

dsikka deleted the kylesayrs/transform-simplify-key branch September 8, 2025 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Transform] Better dispatch support for offloaded and multi-gpu #423

[Transform] Better dispatch support for offloaded and multi-gpu #423

Uh oh!

kylesayrs commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

kylesayrs commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Transform] Better dispatch support for offloaded and multi-gpu #423

[Transform] Better dispatch support for offloaded and multi-gpu #423

Uh oh!

Conversation

kylesayrs commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs commented Sep 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kylesayrs commented Aug 14, 2025 •

edited

Loading