Description
System Info
Hi dear officer
I use Bloom BloomTokenizerFast as a tokenizer. here is an issue.
Version =4.28.0
when I use BloomTokenizerFast, I find the add_prefix_space= True is not useful.
Here is the code.
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom",add_prefix_space = True) print(tokenizer.add_prefix_space) print(tokenizer("Hello world")["input_ids"]) print(transformers.__version__) True [59414, 8876] 4.28.0
here is other code.
from transformers import BloomTokenizerFast tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") print(tokenizer("Hello world")["input_ids"]) [59414, 8876]
I don't know why they will encode the same result.
please have a look!
Thanks
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
It should encode different results. since add_prefix_space= True.
Expected behavior
It should encode different results. since add_prefix_space= True.