Skip to content

Inefficient memory resharding in attention layer #39072

@null-pointer-access

Description

@null-pointer-access

While analyzing the self-attention implementation in HuggingFace Transformers and comparing it to vLLM, I noticed that after the KQV projection step, the model performs a memory resharding process involving three separate Tensor::contiguous() calls to align with the memory layout of the attention kernel. This introduces significant overhead especially under small batch sizes, where kernel launch and memory movement become relatively more expensive.

In contrast, replacing the multiple contiguous() calls with a single fused operator could significantly reduce latency and improve runtime efficiency. This behavior is easy to reproduce by inspecting the self-attention forward pass on a small batch size (e.g., batch=1, seqlen=128, GPT2 model).

# models/gpt2/modeling_gpt2.py:L298
query_states = query_states.view(shape_q).transpose(1, 2)
key_states = key_states.view(shape_kv).transpose(1, 2)
value_states = value_states.view(shape_kv).transpose(1, 2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions