Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Aug 27, 2025

New api:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255, 255]) # init the state with prefill
out = stream.step(tokenizer, 109)
'ั'

and:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream([19567,255]) # init the state with prefill
out = stream.step(tokenizer, [255,109])  # imagine you had an assitant that generated 2 tokens at the same time
'ั'

Non breaking:

from tokenizers import Tokenizers
from tokenizers.decoders import DecodeStream
stream = DecodeStream(False)
stream.step(tokenizer, 19567)
stream.step(tokenizer, 255)
stream.step(tokenizer, 19567)
out = stream.step(tokenizer, 109)
out
'ั'
tokenizer.encode("อั").ids
[19567, 255, 19567, 109]
tokenizer.decode(tokenizer.encode("อั").ids)
'อั'

This could be somewhat expected, but if you initialize your stream, say with [19567, 255, 19567] first, then you should be able to properly get 'อั' if you step with 109.

We can't go against the fact that [19567, 109] is a "valid" token, so in the context of token generation, we cannot go against this (the first token will always be emitted because it is a valid token). However initializing the stream should still be helpful

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker requested a review from McPatate August 29, 2025 07:12
@ArthurZucker ArthurZucker merged commit abee958 into main Aug 29, 2025
30 checks passed
@ArthurZucker ArthurZucker deleted the new-stream branch August 29, 2025 08:06
Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

shenxiangzhuang pushed a commit to shenxiangzhuang/tokenizers that referenced this pull request Aug 29, 2025
* update

* update

* updates

* up

* oikay

* use stream input

* nice all test pass?

* fmt

* dev

* rename

* simplify a hell lot

* proper testing

* fix inti

* fix test

* nits

* make clippy happy now

* fmt fml

* remove the prints

* fix gate
@njhill
Copy link

njhill commented Aug 29, 2025

Thanks @ArthurZucker!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants