Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358

Kevin7720 · 2023-07-25T08:52:09Z

That I have added some new words to t5.get_tokenizer() as shown below:

def get_tokenizer(name):
    tokenizer = T5Tokenizer.from_pretrained(name, model_max_length=MAX_LENGTH)
    new_words  =['XXX', 'OOO', ......]
    tokenizer.add_tokens(new_words)
    return tokenizer

I would like to understand if I need to retrain or fine-tune the EncoderModel after adding these new words to the tokenizer. How will this modification affect the model's performance or behavior?

This question is related to the Imagen project, and I want to ensure that I am following the correct approach when incorporating new words into the tokenizer.

The text was updated successfully, but these errors were encountered:

jacobwjs · 2023-08-24T12:32:35Z

I'm not exactly sure what you mean, but what you proposed won't get you there.

See here:
https://github.com/huggingface/transformers/blob/70b49f023c9f6579c516671604468a491227b4da/src/transformers/tokenization_utils_base.py#L863

When you add new tokens to the vocabulary (and add the entry in the embedding layer), you'll end up with randomly initialized values corresponding to the new token(s).

Kevin7720 · 2023-08-30T03:04:49Z

Thank you for your reply! Your answer has been incredibly helpful to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358

Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358

Kevin7720 commented Jul 25, 2023

jacobwjs commented Aug 24, 2023

Kevin7720 commented Aug 30, 2023

Question about retraining/fine-tuning EncoderModel with new words in t5.get_tokenizer() #358

Question about retraining/fine-tuning EncoderModel with new words in t5.get_tokenizer() #358

Comments

Kevin7720 commented Jul 25, 2023

jacobwjs commented Aug 24, 2023

Kevin7720 commented Aug 30, 2023

Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358

Question about retraining/fine-tuning EncoderModel with new words in `t5.get_tokenizer()` #358