Closed
Description
System Info
transformers 4.28.1
python 3.8.13
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
- I load BertTokenizer using my own vocab.txt, and add '[outline]' into never_split, which is included in my vocab.txt. However, '[outline]' got splitted. Following is my code:
tokenizer = BertTokenizer.from_pretrained(pretrained_path,never_split=['[outline]']) input = "。[outline]" print(tokenizer.tokenize(input)) # ['。', '[', 'out', '##line', ']']
- I also do:
print(tokenizer.basic_tokenizer.tokenize(input)) #['。', '[', 'outline', ']']
Expected behavior
When I do:
tokenizer.tokenize("。[outline]")
Get the result as ['。', '[outline]']
, the tokens in never_split don't be splited.