Skip to content

GPT2 attention scaling config is ignored when using SDPA / FlashAttention backends #44380

@Qi-Zhan

Description

@Qi-Zhan

System Info

None

Who can help?

@ArthurZucker Hi, I'm new to LLMs and currently learning GPT2 model. I found that

The GPT2 attention configuration options:
• scale_attn_weights
• scale_attn_by_inverse_layer_idx

are respected in eager attention mode but silently ignored when using AttentionInterface backends such as "sdpa" or "flash_attention_2".

In eager mode:
the scaling logic is applied inside eager_attention_forward:
• division by sqrt(head_dim) if scale_attn_weights=True
• division by (layer_idx+1) if scale_attn_by_inverse_layer_idx=True

However, when using sdpa:
torch._C._nn.scaled_dot_product_attention(query_states_3, key_states_3, value_states_3, attn_mask = attention_mask_1, dropout_p = 0.0, scale = None, is_causal = False) which seems to ignore the above config.

I realize that the default configuration (scale_attn_weights =True, scale_attn_by_inverse_layer_idx=False) produces the same results, so I’m not sure whether this is intentional or should be considered a bug.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

None

Expected behavior

Different attention implementations should produce semantically equivalent results and respect model configuration parameters.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions