Fix hqq skipped modules and dynamic quant #36821

mobicham · 2025-03-19T13:03:04Z

What does this PR do?

This PR tries to fix a couple of issues with hqq that popped up lately:

skip_modules: skip modules was not working properly. For example, if you specify the vision tower to skip during the quantization step, it was ignored. Now it works.
Some skipped modules were creating issues when the quantized model is loaded.
dynamic_config=True has been broken lately after some changes in transformers. This is now fixed.

You can now skip hqq quantization for the vision tower in VLMs as follows:

import torch
device        = 'cuda:0'
compute_dtype = torch.bfloat16
cache_dir     = None
model_id      = 'google/gemma-3-12b-it'

########################################################################
#Load model
from transformers import HqqConfig, Gemma3ForConditionalGeneration, AutoProcessor

processor = AutoProcessor.from_pretrained(model_id, cache_dir=cache_dir)

#quant_config = HqqConfig(nbits=4, group_size=64, axis=1, skip_modules=['lm_head', 'vision_tower'])

q4_config = {'nbits':4, 'group_size':64}
q3_config = {'nbits':3, 'group_size':64}
quant_config  = HqqConfig(dynamic_config={
  'self_attn.q_proj':q4_config,
  'self_attn.k_proj':q4_config,
  'self_attn.v_proj':q4_config,
  'self_attn.o_proj':q4_config,

  'mlp.gate_proj':q3_config,
  'mlp.up_proj'  :q3_config,
  'mlp.down_proj':q3_config,
}, skip_modules=['lm_head', 'vision_tower'])

model = Gemma3ForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=compute_dtype,
    attn_implementation="sdpa",
    cache_dir=cache_dir,
    quantization_config=quant_config,
    device_map="cuda",
)

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=compute_dtype)

input_len = inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=128, do_sample=False)[0][input_len:]
    decoded    = processor.decode(generation, skip_special_tokens=True)

print(decoded)

⚠️ There's still currently an issue related to saving/loading some quantized hqq models because transformers doesn't make sure the loaded state dict of a certain module contains all the necessary attributes.
Currently, I am using my custom solution which works well - it makes sure that a safetensors chunk contains all the attributes for a given module before saving. We should have a open a separate issue for this though.

Who can review?

@ArthurZucker @SunMarc

github-actions · 2025-03-19T13:03:15Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. When it is ready for review, please click the Ready for review button (at the bottom of the PR page).

SunMarc

Thanks ! Can you add a test to make sure that skip_modules works ?

mobicham · 2025-03-19T14:37:39Z

Thanks ! Can you add a test to make sure that skip_modules works ?

Thank you @SunMarc! Sure, added test_model_serialization_dynamic_quant_with_skip which tests both dynamic quant and the skip.

* Fix hqq skip_modules and dynamic_quant * fix skipped modules loading * add dynamic/skip HqqConfig test

mobicham and others added 4 commits March 19, 2025 11:23

Fix hqq skip_modules and dynamic_quant

b44e6a8

Merge branch 'huggingface:main' into main

6ad8fd6

fix skipped modules loading

4ec1a65

Merge branch 'huggingface:main' into main

0552981

github-actions bot marked this pull request as draft March 19, 2025 13:03

mobicham marked this pull request as ready for review March 19, 2025 13:06

github-actions bot requested review from ArthurZucker and Rocketknight1 March 19, 2025 13:06

SunMarc approved these changes Mar 19, 2025

View reviewed changes

mobicham added 2 commits March 19, 2025 14:35

add dynamic/skip HqqConfig test

3c01ebc

add dynamic/skip HqqConfig test

c0556c0

Merge branch 'huggingface:main' into main

6b3d32e

ArthurZucker approved these changes Mar 20, 2025

View reviewed changes

Merge branch 'main' into main

04f2302

SunMarc merged commit 3e8f0fb into huggingface:main Mar 20, 2025
21 checks passed

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025

Fix hqq skipped modules and dynamic quant (huggingface#36821)

75b20f9

* Fix hqq skip_modules and dynamic_quant * fix skipped modules loading * add dynamic/skip HqqConfig test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hqq skipped modules and dynamic quant #36821

Fix hqq skipped modules and dynamic quant #36821

Uh oh!

mobicham commented Mar 19, 2025

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

SunMarc left a comment

Uh oh!

mobicham commented Mar 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix hqq skipped modules and dynamic quant #36821

Fix hqq skipped modules and dynamic quant #36821

Uh oh!

Conversation

mobicham commented Mar 19, 2025

What does this PR do?

Who can review?

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

mobicham commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mobicham commented Mar 19, 2025 •

edited

Loading