Super slow inference using `eager` attention with `Llama-4-Scout-17B-16E-Instruct`

### System Info

Env:
```
torch                     2.7.1                    pypi_0    pypi
torchvision               0.22.1                   pypi_0    pypi
tqdm                      4.67.1                   pypi_0    pypi
transformers              4.51.0                   pypi_0    pypi
pillow                    11.2.1                   pypi_0    pypi
```

My code:
``` python
import torch
from transformers import AutoProcessor, Llama4ForConditionalGeneration

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="eager",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    cache_dir = cache_dir,
)

messages = {...}

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to("cuda")

# print(inputs)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
)

responses = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
```

If I use `flex_attention` when loading the model I will get this error:

TypeError: pad(): argument 'pad' failed to unpack the object at pos 2 with error "type must be tuple of ints,but got NoneType" #37323

However, if I use `eager` or `sdpa` attention, the inference speed is unreasonably slow, around 30 mins per question.

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

The above env and code should reproduce the error.

### Expected behavior

I want the inference speed to be faster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Super slow inference using `eager` attention with `Llama-4-Scout-17B-16E-Instruct` #38866

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Super slow inference using eager attention with Llama-4-Scout-17B-16E-Instruct #38866

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Super slow inference using `eager` attention with `Llama-4-Scout-17B-16E-Instruct` #38866