Inefficient memory resharding in attention layer

While analyzing the self-attention implementation in HuggingFace Transformers and comparing it to vLLM, I noticed that after the KQV projection step, the model performs a memory resharding process involving three separate Tensor::contiguous() calls to align with the memory layout of the attention kernel. This introduces significant overhead especially under small batch sizes, where kernel launch and memory movement become relatively more expensive.

In contrast, replacing the multiple contiguous() calls with a single fused operator could significantly reduce latency and improve runtime efficiency. This behavior is easy to reproduce by inspecting the self-attention forward pass on a small batch size (e.g., batch=1, seqlen=128, GPT2 model).

```python
# models/gpt2/modeling_gpt2.py:L298
query_states = query_states.view(shape_q).transpose(1, 2)
key_states = key_states.view(shape_kv).transpose(1, 2)
value_states = value_states.view(shape_kv).transpose(1, 2)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inefficient memory resharding in attention layer #39072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inefficient memory resharding in attention layer #39072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions