-
Notifications
You must be signed in to change notification settings - Fork 29.9k
Closed
Labels
Description
System Info
Env:
torch 2.7.1 pypi_0 pypi
torchvision 0.22.1 pypi_0 pypi
tqdm 4.67.1 pypi_0 pypi
transformers 4.51.0 pypi_0 pypi
pillow 11.2.1 pypi_0 pypi
My code:
import torch
from transformers import AutoProcessor, Llama4ForConditionalGeneration
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="eager",
device_map="auto",
torch_dtype=torch.bfloat16,
cache_dir = cache_dir,
)
messages = {...}
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to("cuda")
# print(inputs)
outputs = model.generate(
**inputs,
max_new_tokens=128,
)
responses = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
If I use flex_attention
when loading the model I will get this error:
TypeError: pad(): argument 'pad' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got NoneType" #37323
However, if I use eager
or sdpa
attention, the inference speed is unreasonably slow, around 30 mins per question.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
The above env and code should reproduce the error.
Expected behavior
I want the inference speed to be faster