[PAD_TOKEN] is not used, but just adding 0 #97

goonbamm · 2023-01-27T08:12:03Z

Thanks to your code, I am growing every day. Thank you very much.

In every dataloader, the special tokens are initialized below.

self.SPECIAL_TOKEN = {"CLS_TOKEN": "<|startoftext|>", "SEP_TOKEN": "<|endoftext|>",
                      "MASK_TOKEN": "[MASK]", "UNK_TOKEN": "[UNK]", "PAD_TOKEN": "[PAD]"}

But I found that [MASK], [UNK], and [PAD] are not used in the code. But the problem happens when adding just zero as pad token like below.

while len(input_ids) < self.max_words:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

In vocab, there is no number for [PAD]. Token id '0' is paired with '!'.

vocab = {'!': 0, '"': 1, '#': 2, '$': 3, '%': 4, '&': 5, ... }

If some captions contains '!' and shorter than max_length, the embedding of token '!' and 'pad' will be exactly same because the token embedding method uses nn.embedding.

self.vocab_size = vocab_size
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.ln_final = LayerNorm(transformer_width)

Example

caption1 = 'The boy is crying ! ! ! [PAD] [PAD]'
caption2 = 'The boy is crying [PAD] [PAD] [PAD] [PAD] [PAD]'

I think there is no way to differentiate between the two captions.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PAD_TOKEN] is not used, but just adding 0 #97

[PAD_TOKEN] is not used, but just adding 0 #97

goonbamm commented Jan 27, 2023 •

edited

Loading

[PAD_TOKEN] is not used, but just adding 0 #97

[PAD_TOKEN] is not used, but just adding 0 #97

Comments

goonbamm commented Jan 27, 2023 • edited Loading

Thanks to your code, I am growing every day. Thank you very much.

goonbamm commented Jan 27, 2023 •

edited

Loading