Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix detokenization of non-special added-tokens #3760

Closed

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Oct 24, 2023

Checked with

.\build\bin\Release\main.exe -m models\mpt-7B-storywriter\ggml-model-f16.gguf -p "Once upon a time  there" --temp 0 -n 32

Before:

Once upon a timethere

After:

Once upon a time  there

Also fixes build problems with clang on Windows (not detected by CI?). We can't link with clip and common here due to duplicate symbol errors.

@goerch goerch mentioned this pull request Oct 24, 2023
12 tasks
return -result.length();
}
memcpy(buf, result.c_str(), result.length());
return result.length();
} else if (llama_is_control_token(model->vocab, token)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant, since we ignore the token regardless of it's a control one or not.

Copy link
Collaborator Author

@goerch goerch Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But these are the token types I'd expect to see. In an older version the code asserted that no other token type could occur, but I'm not sure how we should react (assertion, exception or ignorance).

@Galunid
Copy link
Collaborator

Galunid commented Oct 24, 2023

I tested it with #3586 and it works when passing (replaced with __) as part of the prompt, thanks!

The best music is__the kind that makes you feel as though it was written for you, ad nothing more. There's a special way to listen to this music: the way a friend might play an old recording of their favourite band, over and over again, until the grooves are worn down to nubs. Or the way someone might tell a story they've never told anyone else about how they met that

@goerch
Copy link
Collaborator Author

goerch commented Oct 24, 2023

When checking

    std::vector<llama_token> tokens = llama_tokenize(ctx, "[PAD50277]", false, true);

with mpt this resolves to token 50277. This looks wrong to me, should I try to fix this in the PR, too?

I checked the behaviour of the HF tokenizers with

# tests with HF tokenizer

import argparse

from transformers import AutoTokenizer

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)

added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}

with open("result.txt", "w", encoding="utf-8") as file:
    for i in range(tokenizer.vocab_size + len([id for id in added_vocab.values() if id >= tokenizer.vocab_size]) + 1):
        s = tokenizer.decode([i])
        e = tokenizer.encode(s)
        if i not in reverse_vocab:
            file.write(f"pad {i} {s} {e}\n")
        elif reverse_vocab[i] in added_vocab:
            if tokenizer.added_tokens_decoder[i].special:
                file.write(f"special {i} {s} {e}\n")
            else:
                file.write(f"added {i} {s} {e}\n")
        # else:
        #     file.write(f"normal {i} {s} {e}\n")

and get for llama-2-7B

special 0 <unk> [1, 0]
special 1 <s> [1, 1]
special 2 </s> [1, 2]
pad 32000  [1]

and for mpt

special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254                          [50254]
added 50255                         [50255]
added 50256                        [50256]
added 50257                       [50257]
added 50258                      [50258]
added 50259                     [50259]
added 50260                    [50260]
added 50261                   [50261]
added 50262                  [50262]
added 50263                 [50263]
added 50264                [50264]
added 50265               [50265]
added 50266              [50266]
added 50267             [50267]
added 50268            [50268]
added 50269           [50269]
added 50270          [50270]
added 50271         [50271]
added 50272        [50272]
added 50273       [50273]
added 50274      [50274]
added 50275     [50275]
added 50276    [50276]
pad 50277  []

and for causalml

special 151643 <|endoftext|> [151643]
special 151644 <|im_start|> [151644]
special 151645 <|im_end|> [151645]
pad 151851  []

and for stablelm (stablelm-3b-4e1t)

^special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254                          [50254]
added 50255                         [50255]
added 50256                        [50256]
added 50257                       [50257]
added 50258                      [50258]
added 50259                     [50259]
added 50260                    [50260]
added 50261                   [50261]
added 50262                  [50262]
added 50263                 [50263]
added 50264                [50264]
added 50265               [50265]
added 50266              [50266]
added 50267             [50267]
added 50268            [50268]
added 50269           [50269]
added 50270          [50270]
added 50271         [50271]
added 50272        [50272]
added 50273       [50273]
added 50274      [50274]
added 50275     [50275]
added 50276    [50276]
pad 50277  []

This indicates to me our special token handling (CONTROL) is different and our handling of padding tokens could be problematic.

@goerch
Copy link
Collaborator Author

goerch commented Oct 24, 2023

Referencing a comment with my understanding of the situation I see a couple of options:

  • more compatibility with HF tokenizers (might be reasonably easy for our tokenizers, but with some risk breaking the remainder of the stack?)
  • do nothing (token classification might be questionable for added special tokens and padding tokens)
  • fix the problems(bug?) with added special tokens being classified as USER_DEFINED (some risk breaking the remainder of the stack)
  • fix the problem with padding tokens being classified as USER_DEFINED (what should the token type be: UNUSED?)

I'll need some guidance here on how to best proceed.

@cebtenzzre
Copy link
Collaborator

Weren't the padding tokens originally a hack anyway? I don't know that they should be included in the actual tokenizer vocabulary. They only exist because GGUF doesn't (currently) have a way to specify vocab_size.

# The number of tokens in tokenizer.json can differ from the expected vocab size.
# This causes downstream issues with mismatched tensor sizes when running the inference

@cmp-nct
Copy link
Contributor

cmp-nct commented Oct 26, 2023

Weren't the padding tokens originally a hack anyway? I don't know that they should be included in the actual tokenizer vocabulary. They only exist because GGUF doesn't (currently) have a way to specify vocab_size.

# The number of tokens in tokenizer.json can differ from the expected vocab size.
# This causes downstream issues with mismatched tensor sizes when running the inference

I don't think that's the case, though not sure about the origin of it.

Quite often the python HF models are released with tensors and a vocabulary sized larger than the actually real vocabulary is. OpenAssistant (Falcon) for example has 12 missing tokens compared to the tensor size and to the actual vocab_size specified.

I believe the correct way is to generate PAD tokens, they can not be generated anyway it's a placeholder.
Originally it's probably also a placeholder for continued fine-tuning that will unlikely ever be utilized.

@cebtenzzre
Copy link
Collaborator

I believe the correct way is to generate PAD tokens

As far as I understand, HF transformers does not need these PAD tokens in the vocabulary, so why not make things simple and try to be as close to their behavior as we can? We are trying to get the tokenizer to ignore these tokens when they don't really exist in the first place. We could use the UNUSED token type, but it seems like unnecessary effort to generate placeholders just so they can be totally ignored by the implementation.

@TheBloke
Copy link
Contributor

Yes, it'd be awesome if there was a fix that meant these extra tokens weren't needed. Recently CausalLM released with a whopping 213 missing tokens, comparing config.vocab_size to supplied vocabulary.

My 'fix' for that until now has been to add a bunch of <dummyXXX>tokens to added_tokens.json. Kerfluffle recently provided an official implementation of that via --padvocab which I appreciate, though it's obviously still not a very elegant solution overall.

I agree with @cebtenzzre that it'd be really great if llama.cpp could just ignore these 'phantom' tokens, like transformers does.

As to why models do this - in some cases it's because they're rounding the vocab size up to a multiple so that tensor parallelism works. Eg models that have a vocab size of 32032 or 32128 - their actual vocab might be, say, 32003, but that breaks TP so they then round up to the next nearest multiple. Though that's not why OpenAsisstant did it.

@cmp-nct
Copy link
Contributor

cmp-nct commented Oct 26, 2023

You are right, ignoring the mismatch is a better way to solve it. But I'd guess the stored vocab size in the converted model should actually reflect the true vocabulary size and not continue with the wrong one.

Basically it's misbehavior on the HF config side to claim a false vocabulary size, most likely to satisfy something in the transformers libs which then in sequence ignore the wrong number or internally generate PADs.

Adding PAD tokens is/was a quick fix for a wrong config, so either adding PAD tokens or correcting the wrong size number and relying on tensor size for calculations. It probably needs a fix in all eval routines and checks that rely on vocab size.

@ggerganov
Copy link
Owner

ggerganov commented Jan 13, 2024

Continued here: #4916

@ggerganov ggerganov closed this Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants