Skip to content

Super slow inference using eager attention with Llama-4-Scout-17B-16E-Instruct #38866

@Tizzzzy

Description

@Tizzzzy

System Info

Env:

torch                     2.7.1                    pypi_0    pypi
torchvision               0.22.1                   pypi_0    pypi
tqdm                      4.67.1                   pypi_0    pypi
transformers              4.51.0                   pypi_0    pypi
pillow                    11.2.1                   pypi_0    pypi

My code:

import torch
from transformers import AutoProcessor, Llama4ForConditionalGeneration

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="eager",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir = cache_dir,
)

messages = {...}

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to("cuda")

# print(inputs)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
)

responses = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])

If I use flex_attention when loading the model I will get this error:

TypeError: pad(): argument 'pad' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got NoneType" #37323

However, if I use eager or sdpa attention, the inference speed is unreasonably slow, around 30 mins per question.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The above env and code should reproduce the error.

Expected behavior

I want the inference speed to be faster

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions