-
-
Notifications
You must be signed in to change notification settings - Fork 220
Tokenzier special chars #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenzier special chars #197
Conversation
Also, #195 s existence is justified because when creating a chatbot and one provides some HTML tags or something, one really doesn't want that to be encoded as special tokens just to make the backend prompting easier That's why I would be happy if both PRs get merged, to enable maximum flexibility :) |
I'll have to give this a good look a little later. There's currently no code for actually loading all the specific tokens a model might define. Some add like five new tokens and use a vocabulary size of 32005, then define those tokens in a bunch of extra config files packaged with the model, with each having options for how spaces should be encoded before and after those tokens. HF takes it even further by encoding with a trie, which I guess safeguards against a bunch of edge cases, like if one special token is a substring of another special token or whatever. |
So not inside tokenzier.json? I can work on an auto loading of added_tokens from the tokenizer.json if you want that And regarding #195 , I wanted to kindly ask if you intent to merge it? This way I would not have to base my branches off of the boseos branch and it's just an optional parameter anyways that doesn't break anything |
I'll merge #195 when I have a moment to test it. It seems fine. This one is probably good too. As for those extra tokens, you can take a look at this HF model for instance, based off of OASST. added_tokens.json (note that the highest token ID defined is 32004, even though the vocabulary is extended to 32006)
special_tokens_map.json (note how pad_token == eos_token for some reason)
tokenizer_config.json (tokens have empty context here for some reason)
The model's config.json also defines token IDs for EOS, BOS, and PAD, and supposedly strings are also defined in the binary, tokenizer.model. So you see it gets a bit complicated and it's very unclear what to even do with the conflicting definitions. It has a lot of early web vibes, from back in the day when there really weren't any standards other than "the standard is whatever happens to work in IE6". |
I see, I looked into this and found:
As for why the content is empty, no idea, but 90% of models that I've seen do support the standard added_tokens object inside the default tokenizer file, we should only support that (to also kind of enforce a standard, all the extra jsons just don't make any sense whatsoever) All OpenAssistant models I've looked into seemed kinda cursed, some saving the tokenizer.json in some binary format, others having the stuff you sent... If one really wants them to work, they can change 5 lines in the tokenizer.py, that's my take on all of that... If you see it the same way, we could start working on parsing the tokenizer.json and tokenizer_config.json |
#195 (comment)
Hello! :) Per your request:
I added your functionalities just like you described. One can now choose to encode/decode special characters.
This PR is based or rather dependent on the other PR ( #195 ) (because locally I need both and I don't know GitHub well enough to merge both)
Example:
The string:
<s><s>Hello</s><unk> <s>World</s></s>
encodes to:
tensor([[ 1, 1, 15043, 2, 0, 259, 1, 2787, 2, 2]])
(I verified all IDs manually)(without encode_special_characters:
tensor([[ 529, 29879, 5299, 29879, 29958, 10994, 829, 29879, 5299, 2960, 29958, 529, 29879, 29958, 14058, 829, 29879, 2565, 29879, 29958]])
)Decoding these IDs results in:
<s><s>Hello</s> <s>World</s></s>
Which is correct! (
<unk>
is the same id as the padding token, 0. If you want, I can also include every 0 to be a<unk>
)If you want to test and verify, here is the snippet that I put into the tokenizer init for debugging:
I would be happy if you accepted both pull requests and I hope this PR aligns with your intents :)