Tokenzier special chars #197

SinanAkkoyun · 2023-07-25T21:47:16Z

#195 (comment)
Hello! :) Per your request:

I.e. you should just be able to add ~~and~~ at arbitrary places in the input string, as well as any other custom special tokens any given model requires. Ideally it would work when decoding as well.

I added your functionalities just like you described. One can now choose to encode/decode special characters.

This PR is based or rather dependent on the other PR ( #195 ) (because locally I need both and I don't know GitHub well enough to merge both)

Example:

The string: <s><s>Hello</s><unk> <s>World</s></s>
encodes to: tensor([[ 1, 1, 15043, 2, 0, 259, 1, 2787, 2, 2]]) (I verified all IDs manually)

(without encode_special_characters: tensor([[ 529, 29879, 5299, 29879, 29958, 10994, 829, 29879, 5299, 2960, 29958, 529, 29879, 29958, 14058, 829, 29879, 2565, 29879, 29958]]))

Decoding these IDs results in:
<s><s>Hello</s> <s>World</s></s>

Which is correct! (<unk> is the same id as the padding token, 0. If you want, I can also include every 0 to be a <unk>)

If you want to test and verify, here is the snippet that I put into the tokenizer init for debugging:

        text = "<s><s>Hello</s><unk> <s>World</s></s>"
        print("encoding: " + text)
        print(self.encode(text, encode_special_characters=True))
        print(self.encode([text, text, text, text], encode_special_characters=True))

        ids = self.encode(text, encode_special_characters=True)
        idsb = self.encode([text, text, text, text], encode_special_characters=True)
        print(f"decoding: {self.decode(ids, decode_special_characters=True)}")
        print(f"decoding batch: {self.decode(idsb, decode_special_characters=True)}")

I would be happy if you accepted both pull requests and I hope this PR aligns with your intents :)

SinanAkkoyun · 2023-07-25T21:50:47Z

Also, #195 s existence is justified because when creating a chatbot and one provides some HTML tags or something, one really doesn't want that to be encoded as special tokens just to make the backend prompting easier

That's why I would be happy if both PRs get merged, to enable maximum flexibility :)

turboderp · 2023-07-25T22:07:00Z

I'll have to give this a good look a little later. There's currently no code for actually loading all the specific tokens a model might define. Some add like five new tokens and use a vocabulary size of 32005, then define those tokens in a bunch of extra config files packaged with the model, with each having options for how spaces should be encoded before and after those tokens. HF takes it even further by encoding with a trie, which I guess safeguards against a bunch of edge cases, like if one special token is a substring of another special token or whatever.

SinanAkkoyun · 2023-07-25T22:18:22Z

in a bunch of extra config files packaged with the model

So not inside tokenzier.json? I can work on an auto loading of added_tokens from the tokenizer.json if you want that
(I would let the <s>, </s>, and <unk> tokens stay in the code as is and not dyn. load them, because otherwise they would be self.s_token instead of self.bos_token)
, but if some models use extra configs that are not standardized it seems rather difficult to implement

And regarding #195 , I wanted to kindly ask if you intent to merge it? This way I would not have to base my branches off of the boseos branch and it's just an optional parameter anyways that doesn't break anything

turboderp · 2023-07-25T22:47:35Z

I'll merge #195 when I have a moment to test it. It seems fine. This one is probably good too.

As for those extra tokens, you can take a look at this HF model for instance, based off of OASST.

added_tokens.json (note that the highest token ID defined is 32004, even though the vocabulary is extended to 32006)

{
  "<|assistant|>": 32004,
  "<|prefix_begin|>": 32000,
  "<|prefix_end|>": 32003,
  "<|prompter|>": 32002,
  "<|system|>": 32001
}

special_tokens_map.json (note how pad_token == eos_token for some reason)

{
  "additional_special_tokens": [
    "<|prompter|>",
    "<|system|>",
    "<|prefix_begin|>",
    "<|prefix_end|>",
    "<|assistant|>"
  ],
  "bos_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": "</s>",
  "pad_token": "</s>",
  "sep_token": "<s>",
  "unk_token": {
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

tokenizer_config.json (tokens have empty context here for some reason)

{
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": {
    "__type": "AddedToken",
    "content": "",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

The model's config.json also defines token IDs for EOS, BOS, and PAD, and supposedly strings are also defined in the binary, tokenizer.model. So you see it gets a bit complicated and it's very unclear what to even do with the conflicting definitions. It has a lot of early web vibes, from back in the day when there really weren't any standards other than "the standard is whatever happens to work in IE6".

SinanAkkoyun · 2023-07-26T12:13:07Z

I see, I looked into this and found:

added_tokens.json is only specific to OpenAssistant's Llama models
despite conflicts in added_tokens.json, the tokenizer.json must always containt the added_tokens and always has it as the model intends

As for why the content is empty, no idea, but 90% of models that I've seen do support the standard added_tokens object inside the default tokenizer file, we should only support that (to also kind of enforce a standard, all the extra jsons just don't make any sense whatsoever)

All OpenAssistant models I've looked into seemed kinda cursed, some saving the tokenizer.json in some binary format, others having the stuff you sent... If one really wants them to work, they can change 5 lines in the tokenizer.py, that's my take on all of that...

If you see it the same way, we could start working on parsing the tokenizer.json and tokenizer_config.json

SinanAkkoyun added 6 commits July 25, 2023 12:25

Added pad bos eos functionality (and .gitignore)

d96825f

Deleted debug lines

f20c149

Changed 'pad' to 'add'

12aac52

fix

bf88487

Added encoding and decoding of special tokens!

37f34f0

Removed debug line

297f711

SinanAkkoyun added 2 commits July 26, 2023 00:03

Added num_tokens special token functionality

cbe0b4c

Cleaned up code

1ed0c38

Merge branch 'master' into tokenzier_special_chars

bf96aab

turboderp merged commit ddbdb8e into turboderp:master Jul 26, 2023

zmarty mentioned this pull request Aug 10, 2023

Modify generator.py > generate_simple to accept encode_special_characters? #243

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Tokenzier special chars #197

Tokenzier special chars #197

Uh oh!

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

Uh oh!

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

Uh oh!

turboderp commented Jul 25, 2023

Uh oh!

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

Uh oh!

turboderp commented Jul 25, 2023

Uh oh!

SinanAkkoyun commented Jul 26, 2023

Uh oh!

Uh oh!

Uh oh!

Tokenzier special chars #197

Tokenzier special chars #197

Uh oh!

Conversation

SinanAkkoyun commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SinanAkkoyun commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboderp commented Jul 25, 2023

Uh oh!

SinanAkkoyun commented Jul 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboderp commented Jul 25, 2023

Uh oh!

SinanAkkoyun commented Jul 26, 2023

Uh oh!

Uh oh!

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading

SinanAkkoyun commented Jul 25, 2023 •

edited

Loading