Skip to content

Conversation

@brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Oct 30, 2025

SUMMARY:
It appears AutoAWQ makes an implicit assumption that, when caching inputs to the forward call of certain modules, only the first input is needed, and all other kwargs can be shared, so that they don't have to be redundantly stored in GPU VRAM.

When we first ported AutoAWQ, we thought this seemed incorrect and could lead to poor behavior. Our implementation cached all the args into a given module's forward call, so that it is guaranteed to be replicated correctly at the expense of GPU VRAM.

Now that we are revisiting performance improvements for AWQ, I wanted to add AutoAWQ's design choice as a toggleable field on AWQModifier.

  • If AWQModifier(..., use_auto_awq_mem_hack=True), we will use AutoAWQ's technique to cache to a field
    • _model_kwargs_cache: IntermediatesCache
  • Otherwise, we will cache to a field
    • _parent_kwargs_cache: dict[Module, IntermediatesCache]

I am pretty sure that's what this PR has, but when I compare the running of examples/awq/qwen3_moe_example.py side-by-side with the field False or True, I don't see any meaningful difference in the VRAM usage. I need to do some more debugging to make sure this is working as intended (and compare to VRAM usage when running AutoAWQ)

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants