Discrepancy between vocabulary size in model and tokenizer leading to bugs #11

jaanli · 2024-03-15T19:49:10Z

Hi! Had a quick question about the discrepancy between the input embeddings:

model = AutoModel.from_pretrained('UFNLP/gatortron-base')
model.embeddings.word_embeddings.shape

There are 50176 in this module, but the tokenizer has 50101 vocabulary items (https://huggingface.co/UFNLP/gatortron-base/raw/main/vocab.txt).

Is there a reason for this discrepancy? It is making us hard-code the vocabulary size to fix this, and we hope we are correctly initializing from gatortron.

Otherwise, thank you so much for open sourcing this! It is extremely helpful :)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between vocabulary size in model and tokenizer leading to bugs #11

Discrepancy between vocabulary size in model and tokenizer leading to bugs #11

jaanli commented Mar 15, 2024

Discrepancy between vocabulary size in model and tokenizer leading to bugs #11

Discrepancy between vocabulary size in model and tokenizer leading to bugs #11

Comments

jaanli commented Mar 15, 2024