Breaking changes in v0.19.1 for tiktoken/llama3


```
import tokenizers
def show_tokenization(tok, s):
    ids = tok.encode(s, add_special_tokens=False).ids
    print([(i, tok.decode([i])) for i in ids])

def show_tokenization_from_id(tok, id):
    s = tok.decode([id])
    print(f"id {id} decodes to {s!r}, which encodes to...")
    show_tokenization(tok, s)

fb_tok = tokenizers.Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
show_tokenization_from_id(112328)
```

v0.19.0
id 112328 decodes to ' Arthropoda', which encodes to...
[(1676, ' Ar'), (98643, 'throp'), (14320, 'oda')]

v0.19.1
id 112328 decodes to ' Arthropoda', which encodes to...
[(112328, ' Arthropoda')]

I have good evidence that the new behaviour is how the model was trained, but the announcement of the patch release should perhaps be a little louder in advising to e.g. retokenize all training data for particular model families.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Breaking changes in v0.19.1 for tiktoken/llama3 #1512

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions