Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions src/transformers/models/falcon/modeling_falcon.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,13 +892,17 @@ def forward(
Each element of `past_key_values` is a tuple (past_key, past_value):
- past_key: [batch_size * num_heads, head_dim, kv_length]
- past_value: [batch_size * num_heads, kv_length, head_dim]
attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
attention_mask (`torch.Tensor` of 2D shape `(batch_size, sequence_length)`
or 4D shape `(heads, batch_size, sequence_length, total_sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.

[What are attention masks?](../glossary#attention-mask)

Attention mask may be supplied in 4D shape for finer control of attention patterns within sequences.
In such case, the `position_ids` parameters must be customised accordingly.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
Expand Down Expand Up @@ -1094,7 +1098,7 @@ def forward(
position_ids = position_ids.unsqueeze(0)

if self._use_flash_attention_2:
# 2d mask is passed through the layers
# the original 2d or 4d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._use_sdpa and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
Expand Down Expand Up @@ -1137,7 +1141,7 @@ def forward(
attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
)
else:
# 4d mask is passed through the layers
# a modified 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
)
Expand Down
10 changes: 7 additions & 3 deletions src/transformers/models/llama/modeling_llama.py
Original file line number Diff line number Diff line change
Expand Up @@ -874,7 +874,8 @@ def _init_weights(self, module):
[`PreTrainedTokenizer.__call__`] for details.

[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
attention_mask (`torch.Tensor` of 2D shape `(batch_size, sequence_length)`
or 4D shape `(heads, batch_size, sequence_length, total_sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

- 1 for tokens that are **not masked**,
Expand All @@ -885,6 +886,9 @@ def _init_weights(self, module):
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.

Attention mask may be supplied in 4D shape for finer control of attention patterns within sequences.
In such case, the `position_ids` parameters must be customised accordingly.

If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).

Expand Down Expand Up @@ -1025,7 +1029,7 @@ def forward(
inputs_embeds = self.embed_tokens(input_ids)

if self._use_flash_attention_2:
# 2d mask is passed through the layers
# the original 2d or 4d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._use_sdpa and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
Expand All @@ -1037,7 +1041,7 @@ def forward(
past_key_values_length,
)
else:
# 4d mask is passed through the layers
# a modified 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
)
Expand Down
6 changes: 5 additions & 1 deletion src/transformers/models/xglm/modeling_xglm.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,13 +68,17 @@
[`PreTrainedTokenizer.__call__`] for details.

[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
attention_mask (`torch.Tensor` of 2D shape `(batch_size, sequence_length)`
or 4D shape `(heads, batch_size, sequence_length, total_sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.

[What are attention masks?](../glossary#attention-mask)

Attention mask may be supplied in 4D shape for finer control of attention patterns within sequences.
In such case, the `position_ids` parameters must be customised accordingly.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.max_position_embeddings - 1]`.
Expand Down