-
Notifications
You must be signed in to change notification settings - Fork 30k
Closed
Description
While analyzing the self-attention implementation in HuggingFace Transformers and comparing it to vLLM, I noticed that after the KQV projection step, the model performs a memory resharding process involving three separate Tensor::contiguous() calls to align with the memory layout of the attention kernel. This introduces significant overhead especially under small batch sizes, where kernel launch and memory movement become relatively more expensive.
In contrast, replacing the multiple contiguous() calls with a single fused operator could significantly reduce latency and improve runtime efficiency. This behavior is easy to reproduce by inspecting the self-attention forward pass on a small batch size (e.g., batch=1, seqlen=128, GPT2 model).
# models/gpt2/modeling_gpt2.py:L298
query_states = query_states.view(shape_q).transpose(1, 2)
key_states = key_states.view(shape_kv).transpose(1, 2)
value_states = value_states.view(shape_kv).transpose(1, 2)
Metadata
Metadata
Assignees
Labels
No labels