Skip to content

Tokenzier special chars #197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 26, 2023

Conversation

SinanAkkoyun
Copy link
Contributor

@SinanAkkoyun SinanAkkoyun commented Jul 25, 2023

#195 (comment)
Hello! :) Per your request:

I.e. you should just be able to add and at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

I added your functionalities just like you described. One can now choose to encode/decode special characters.

This PR is based or rather dependent on the other PR ( #195 ) (because locally I need both and I don't know GitHub well enough to merge both)

Example:

The string: <s><s>Hello</s><unk> <s>World</s></s>
encodes to: tensor([[ 1, 1, 15043, 2, 0, 259, 1, 2787, 2, 2]]) (I verified all IDs manually)

(without encode_special_characters: tensor([[ 529, 29879, 5299, 29879, 29958, 10994, 829, 29879, 5299, 2960, 29958, 529, 29879, 29958, 14058, 829, 29879, 2565, 29879, 29958]]))

Decoding these IDs results in:
<s><s>Hello</s> <s>World</s></s>

Which is correct! (<unk> is the same id as the padding token, 0. If you want, I can also include every 0 to be a <unk>)

If you want to test and verify, here is the snippet that I put into the tokenizer init for debugging:

        text = "<s><s>Hello</s><unk> <s>World</s></s>"
        print("encoding: " + text)
        print(self.encode(text, encode_special_characters=True))
        print(self.encode([text, text, text, text], encode_special_characters=True))

        ids = self.encode(text, encode_special_characters=True)
        idsb = self.encode([text, text, text, text], encode_special_characters=True)
        print(f"decoding: {self.decode(ids, decode_special_characters=True)}")
        print(f"decoding batch: {self.decode(idsb, decode_special_characters=True)}")

I would be happy if you accepted both pull requests and I hope this PR aligns with your intents :)

@SinanAkkoyun
Copy link
Contributor Author

SinanAkkoyun commented Jul 25, 2023

Also, #195 s existence is justified because when creating a chatbot and one provides some HTML tags or something, one really doesn't want that to be encoded as special tokens just to make the backend prompting easier

That's why I would be happy if both PRs get merged, to enable maximum flexibility :)

@turboderp
Copy link
Owner

I'll have to give this a good look a little later. There's currently no code for actually loading all the specific tokens a model might define. Some add like five new tokens and use a vocabulary size of 32005, then define those tokens in a bunch of extra config files packaged with the model, with each having options for how spaces should be encoded before and after those tokens. HF takes it even further by encoding with a trie, which I guess safeguards against a bunch of edge cases, like if one special token is a substring of another special token or whatever.

@SinanAkkoyun
Copy link
Contributor Author

SinanAkkoyun commented Jul 25, 2023

in a bunch of extra config files packaged with the model

So not inside tokenzier.json? I can work on an auto loading of added_tokens from the tokenizer.json if you want that
(I would let the <s>, </s>, and <unk> tokens stay in the code as is and not dyn. load them, because otherwise they would be self.s_token instead of self.bos_token)
, but if some models use extra configs that are not standardized it seems rather difficult to implement

And regarding #195 , I wanted to kindly ask if you intent to merge it? This way I would not have to base my branches off of the boseos branch and it's just an optional parameter anyways that doesn't break anything

@turboderp
Copy link
Owner

I'll merge #195 when I have a moment to test it. It seems fine. This one is probably good too.

As for those extra tokens, you can take a look at this HF model for instance, based off of OASST.

added_tokens.json (note that the highest token ID defined is 32004, even though the vocabulary is extended to 32006)

{
  "<|assistant|>": 32004,
  "<|prefix_begin|>": 32000,
  "<|prefix_end|>": 32003,
  "<|prompter|>": 32002,
  "<|system|>": 32001
}

special_tokens_map.json (note how pad_token == eos_token for some reason)

{
  "additional_special_tokens": [
    "<|prompter|>",
    "<|system|>",
    "<|prefix_begin|>",
    "<|prefix_end|>",
    "<|assistant|>"
  ],
  "bos_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": "</s>",
  "pad_token": "</s>",
  "sep_token": "<s>",
  "unk_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

tokenizer_config.json (tokens have empty context here for some reason)

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

The model's config.json also defines token IDs for EOS, BOS, and PAD, and supposedly strings are also defined in the binary, tokenizer.model. So you see it gets a bit complicated and it's very unclear what to even do with the conflicting definitions. It has a lot of early web vibes, from back in the day when there really weren't any standards other than "the standard is whatever happens to work in IE6".

@SinanAkkoyun
Copy link
Contributor Author

I see, I looked into this and found:

  • added_tokens.json is only specific to OpenAssistant's Llama models
  • despite conflicts in added_tokens.json, the tokenizer.json must always containt the added_tokens and always has it as the model intends

As for why the content is empty, no idea, but 90% of models that I've seen do support the standard added_tokens object inside the default tokenizer file, we should only support that (to also kind of enforce a standard, all the extra jsons just don't make any sense whatsoever)

All OpenAssistant models I've looked into seemed kinda cursed, some saving the tokenizer.json in some binary format, others having the stuff you sent... If one really wants them to work, they can change 5 lines in the tokenizer.py, that's my take on all of that...

If you see it the same way, we could start working on parsing the tokenizer.json and tokenizer_config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants