System Info
None
Who can help?
@ArthurZucker Hi, I'm new to LLMs and currently learning GPT2 model. I found that
The GPT2 attention configuration options:
• scale_attn_weights
• scale_attn_by_inverse_layer_idx
are respected in eager attention mode but silently ignored when using AttentionInterface backends such as "sdpa" or "flash_attention_2".
In eager mode:
the scaling logic is applied inside eager_attention_forward:
• division by sqrt(head_dim) if scale_attn_weights=True
• division by (layer_idx+1) if scale_attn_by_inverse_layer_idx=True
However, when using sdpa:
torch._C._nn.scaled_dot_product_attention(query_states_3, key_states_3, value_states_3, attn_mask = attention_mask_1, dropout_p = 0.0, scale = None, is_causal = False) which seems to ignore the above config.
I realize that the default configuration (scale_attn_weights =True, scale_attn_by_inverse_layer_idx=False) produces the same results, so I’m not sure whether this is intentional or should be considered a bug.
Information
Tasks
Reproduction
None
Expected behavior
Different attention implementations should produce semantically equivalent results and respect model configuration parameters.
System Info
None
Who can help?
@ArthurZucker Hi, I'm new to LLMs and currently learning GPT2 model. I found that
The GPT2 attention configuration options:
• scale_attn_weights
• scale_attn_by_inverse_layer_idx
are respected in eager attention mode but silently ignored when using AttentionInterface backends such as "sdpa" or "flash_attention_2".
In eager mode:
the scaling logic is applied inside eager_attention_forward:
• division by sqrt(head_dim) if scale_attn_weights=True
• division by (layer_idx+1) if scale_attn_by_inverse_layer_idx=True
However, when using sdpa:
torch._C._nn.scaled_dot_product_attention(query_states_3, key_states_3, value_states_3, attn_mask = attention_mask_1, dropout_p = 0.0, scale = None, is_causal = False)which seems to ignore the above config.I realize that the default configuration (
scale_attn_weights =True, scale_attn_by_inverse_layer_idx=False) produces the same results, so I’m not sure whether this is intentional or should be considered a bug.Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
None
Expected behavior
Different attention implementations should produce semantically equivalent results and respect model configuration parameters.