Inference with model.generate( ) using a quantized model leads to assertion error

### System Info

Linux
transformers==4.52.4
bitsandbytes==0.46.1

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

~~~
m = "microsoft/phi-4"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True
)

tokenizer = AutoTokenizer.from_pretrained(m)
model = AutoModelForCausalLM.from_pretrained(m, quantization_config=bnb_config, device_map='auto')
tokenizer.pad_token_id = tokenizer.eos_token_id

inputs = tokenizer.apply_chat_template(prompt, return_tensors="pt").to('cuda')
out = model.generate(inputs, max_new_tokens=50, synced_gpus=True)

~~~

Run with: 
~~~
torchrun --nproc-per-node=2 script.py
~~~

Works perfectly fine with a single GPU setup, but produces assertion error when running on multiple GPUs

The error can be traced back to model.generate() function

error: Assertion error, python3.10/site-packages/bitsandbytes/nn/modules.py in fix_4bit_weight_quant_state_from_module
assert module.weight.shape[1] == 1

### Expected behavior

Expect the model to execute generation without error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inference with model.generate( ) using a quantized model leads to assertion error #39311

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inference with model.generate( ) using a quantized model leads to assertion error #39311

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions