Skip to content

Conversation

@kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Aug 14, 2025

Purpose

  • In order for vLLM to properly load transform weights, it must be capable of constructing sharing tensors in a way that is independent of which device map was used during compression by LLM Compressor.
    • Right now, the construction of shared tensors is dependent on the device of the parent module, which is determined by the device map used by LLM Compressor
    • If LLM Compressor decides to generate transforms on a model which is split on two GPUs, the transform will be generated on one GPU and then moved to the other GPU at runtime
  • Construct weights based on the precision of the scheme, not the precision of the weight
    • This also simplifies the key used by vLLM loading
    • The dtype will still work due to how we how upcast dtypes when applying weights
  • Support creating transforms for both offloaded (sequential) and multi-gpu (basic) dispatches

Changes

  • Use scheme.precision rather than module.dtype when constructing parameters
  • Do not support different devices when constructing transform weights. Instead use first device seen (typically CPU) and ensure future devices match that device
  • Fix get_offloaded_device in the case that the module is not offloaded (such as attention)
  • Add a TQDM for transforms

Testing

  • Transform precision and shared tensors tests pass
  • Ran quip example to completion with sequential and basic pipelines
  • Quip correctness tests pass

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs force-pushed the kylesayrs/transform-simplify-key branch from e68f4f7 to f8f7156 Compare August 14, 2025 18:25
@kylesayrs kylesayrs changed the title [Transform] Simplify weight construction keys [Transform] Guard against multi-gpu transforms Aug 14, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs kylesayrs changed the title [Transform] Guard against multi-gpu transforms [Transform] Better dispatch support for transforms Aug 26, 2025
@kylesayrs kylesayrs changed the title [Transform] Better dispatch support for transforms [Transform] Better dispatch support for offloaded and mult-gpu Aug 26, 2025
@kylesayrs kylesayrs changed the title [Transform] Better dispatch support for offloaded and mult-gpu [Transform] Better dispatch support for offloaded and multi-gpu Aug 26, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving with question

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead use first device seen (typically CPU) and ensure future devices match that device

How is this ensured?

@kylesayrs
Copy link
Collaborator Author

@dsikka This is ensured by moving the weight to the device of the value. https://github.com/neuralmagic/compressed-tensors/pull/423/files#diff-be313d6f55c99277b8d747f1d5470f9cf0f08d99ffdb1fc45ac8df80f8784c59R114

This to call only has an effect when the model is dispatched to multiple GPUs. This means that multi-gpu transforms will be supported, but will have a runtime cost. Our examples only focus on single-gpu/ offloaded transforms for now.

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@dsikka dsikka merged commit 7734cce into main Sep 8, 2025
2 checks passed
@dsikka dsikka deleted the kylesayrs/transform-simplify-key branch September 8, 2025 18:46
Etelis added a commit to Etelis/compressed-tensors that referenced this pull request Sep 11, 2025
…-project#423)

* key by weight only

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* always return on CPU, onload at runtime

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* fix get_offloaded_device

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* reduce diff

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* reduce diff

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* reduce diff

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* move to device to support pipeline parallel

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* eagerly generate with precision

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

* add comment

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

---------

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants