Skip to content

bloom add_prefix_space= True #24846

Closed
@Dongximing

Description

@Dongximing

System Info

Hi dear officer
I use Bloom BloomTokenizerFast as a tokenizer. here is an issue.

Version =4.28.0
when I use BloomTokenizerFast, I find the add_prefix_space= True is not useful.
Here is the code.
tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom",add_prefix_space = True) print(tokenizer.add_prefix_space) print(tokenizer("Hello world")["input_ids"]) print(transformers.__version__) True [59414, 8876] 4.28.0
here is other code.
from transformers import BloomTokenizerFast tokenizer = BloomTokenizerFast.from_pretrained("bigscience/bloom") print(tokenizer("Hello world")["input_ids"]) [59414, 8876]
I don't know why they will encode the same result.

please have a look!
Thanks

Who can help?

@arth

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

It should encode different results. since add_prefix_space= True.

Expected behavior

It should encode different results. since add_prefix_space= True.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions