-
Notifications
You must be signed in to change notification settings - Fork 948
Closed as not planned
Labels
Description
import tokenizers
def show_tokenization(tok, s):
ids = tok.encode(s, add_special_tokens=False).ids
print([(i, tok.decode([i])) for i in ids])
def show_tokenization_from_id(tok, id):
s = tok.decode([id])
print(f"id {id} decodes to {s!r}, which encodes to...")
show_tokenization(tok, s)
fb_tok = tokenizers.Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
show_tokenization_from_id(112328)
v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]
v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]
I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.
thusinh1969thusinh1969