Skip to content

Inference with model.generate( ) using a quantized model leads to assertion error #39311

@Sandipan99

Description

@Sandipan99

System Info

Linux
transformers==4.52.4
bitsandbytes==0.46.1

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

m = "microsoft/phi-4"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained(m)
model = AutoModelForCausalLM.from_pretrained(m, quantization_config=bnb_config, device_map='auto')
tokenizer.pad_token_id = tokenizer.eos_token_id

inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt").to('cuda')
out = model.generate(inputs, max_new_tokens=50, synced_gpus=True)

Run with:

torchrun --nproc-per-node=2 script.py

Works perfectly fine with a single GPU setup, but produces assertion error when running on multiple GPUs

The error can be traced back to model.generate() function

error: Assertion error, python3.10/site-packages/bitsandbytes/nn/modules.py in fix_4bit_weight_quant_state_from_module
assert module.weight.shape[1] == 1

Expected behavior

Expect the model to execute generation without error

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions