Roberta's Positional Embedding Offset #5285

h324yang · 2020-06-25T18:27:36Z

transformers/src/transformers/modeling_bart.py

Line 754 in d4c2cb4

num_embeddings += padding_idx + 1 # WHY?

transformers/src/transformers/modeling_bart.py

Line 763 in d4c2cb4

positions = create_position_ids_from_input_ids(input, self.padding_idx)

So this offset is added because the function create_position_ids_from_input_ids shifts the position ids by padding_idx + 1. However, I wonder if other models should also include this?

transformers/src/transformers/modeling_roberta.py

Line 54 in d4c2cb4

    
           config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx

For instance, when I am using Longformer, it looks like the offset is not added to Roberta, so I need to add such a offset to config.max_position_embeddings

The text was updated successfully, but these errors were encountered:

sshleifer · 2020-06-25T18:47:04Z

That's certainly possible. As you can see from my comment, and PR #5188 , I don't fully understand the motivation for the offset. It is very tricky.

cccntu · 2020-08-26T13:36:33Z

I figured out why. See here facebookresearch/fairseq#1177
So basically the purpose is to make positional embedding = 0 on padding positions (positions where token is padding token), using the padding_idx parameter in torch.nn.Embedding.

I think we can simply use masked_fill() to make positional embedding = 0 on padding positions, so the code is easier to understand (no need for the offset).

sshleifer · 2020-08-26T14:14:08Z

Exactly!
Would love to do that, but the migration of the existing bart state dicts is non trivial, since they already store the extra position embedding. Even if we tracked down all bart models with config.static_position_embeddings=False and resized their positional embeddings, we would break code that is not up to date w master (lots of code).

So I think we must settle for documenting what is going on better in LearnedPositionalEmbedding and accept the unfortunate reality that we are stuck with the offset forever (or until we have some futuristic model hub tooling to version state dicts).

stale · 2020-10-25T15:41:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sshleifer changed the title ~~Positional Embedding Offset~~ Roberta's Positional Embedding Offset Jun 25, 2020

stale bot added the wontfix label Oct 25, 2020

stale bot closed this as completed Nov 1, 2020

NielsRogge mentioned this issue Mar 16, 2021

Position ids in RoBERTa #10736

Closed

This was referenced Aug 18, 2022

Max class probability too low with a multi-class classifier cdpierse/transformers-interpret#65

Closed

PairwiseSequenceClassificationExplainer, RoBERTa bug fixes, GH Actions migration cdpierse/transformers-interpret#99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roberta's Positional Embedding Offset #5285

Roberta's Positional Embedding Offset #5285

h324yang commented Jun 25, 2020

sshleifer commented Jun 25, 2020

cccntu commented Aug 26, 2020 •

edited by sshleifer

Loading

sshleifer commented Aug 26, 2020

stale bot commented Oct 25, 2020

Roberta's Positional Embedding Offset #5285

Roberta's Positional Embedding Offset #5285

Comments

h324yang commented Jun 25, 2020

sshleifer commented Jun 25, 2020

cccntu commented Aug 26, 2020 • edited by sshleifer Loading

sshleifer commented Aug 26, 2020

stale bot commented Oct 25, 2020

cccntu commented Aug 26, 2020 •

edited by sshleifer

Loading