Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spm whitespaces #2806

Merged
merged 3 commits into from
Aug 26, 2023
Merged

Fix spm whitespaces #2806

merged 3 commits into from
Aug 26, 2023

Conversation

klosax
Copy link
Collaborator

@klosax klosax commented Aug 26, 2023

Fix so prepending whitespace to prompt when using spm tokenizer is done in user code instead of in the tokenizer.
This will restore pre-gguf accuracy in HellaSwag with models using the spm tokenizer (#2321 (comment)).

@klosax klosax mentioned this pull request Aug 26, 2023
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my knowledge level for tokenizing stuff falls short of being able to review something like this. I really only have basic familiarity with the broad strokes.

Copy link
Contributor

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HellaSwag score is the same as pre-GGUF up to 400 tasks. It eventually diverges slightly. I'm running it right now with this PR, and for LLaMA-v2-7B I get 75.1 at task 1000 vs 75.2 pre-GGUF. I guess, this is to be expected given all other tokenizer changes?


if (llama_vocab_type(ctx) == LLAMA_VOCAB_TYPE_SPM) {
// Add a space in front of the first character to match OG llama tokenizer behavior
params.prompt.insert(0, 1, ' ');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we insist to automatically add white space to the user's prompt? Isn't it better to let the user decide? Me as a user can decide to prompt with -p "Blah blah" or -p " Blah blah". I know the LLM response will be different, and I can decide to do whatever I believe is better. The way it is here, I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but users in general have no clue that the sentencepiece tokenizer should have a space prepended to every word including the first word in the prompt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no way to prompt the model without a space at the beginning of my prompt, even if I knew exactly that this would be better.

Is there any model that llama.cpp supports where it would actually be better? (Worst comes to worse, could probably add a commandline argument to disable that behavior but there would probably have to be a practical use case.)

@klosax klosax merged commit 2ba83c8 into ggerganov:master Aug 26, 2023
25 checks passed
@ggerganov
Copy link
Owner

ggerganov commented Aug 26, 2023

We are still doing something wrong somewhere. When I test the OG sentencepiece tokenizer with the following python code:

import os
import sys
import argparse

from sentencepiece import SentencePieceProcessor

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = SentencePieceProcessor(dir_tokenizer + '/tokenizer.model')

text = 'Hello, world!'
print(text)
print(tokenizer.encode(text, add_bos=True))
print(tokenizer.decode(tokenizer.encode(text, add_bos=True)))

I get the output:

Hello, world!
[1, 15043, 29892, 3186, 29991]
Hello, world!

Our tokenizer tokenizes correctly, however the de-tokenization incorrectly prefixes the text with a whitespace:

main : 'Hello, world!' tokenized to ' Hello, world!'

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

@klosax
Copy link
Collaborator Author

klosax commented Aug 26, 2023

Also, I don't think that manually having to prefix with a whitespace in user code is correct.
The Python tokenizer does not have such a requirement

I just restored the pre-gguf behavior.

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.
Maybe we could remove it? Or at least have it optional like the add_bos param to the tokenizer.

@ggerganov
Copy link
Owner

I can not find any sources that says that the correct way is to prepend a whitespace to the prompt.

I don't know what was the intent when it was first added. The comment says "to match the OG tokenizer", but not sure in what way it matches it.

Regarding the extra leading space during de-tokenization, I can see that in llama2.c they explicitly remove the leading space if the previous token is BOS:

https://github.com/karpathy/llama2.c/blob/49daf18f2f85cab239e80b8a452a2f78f295032f/run.c#L423-L426

I guess I'll add this to our de-tokenization code as well. This does not affect any of the generation results, but fixes a display problem and makes it consistent with the OG behaviour.

mattgauf added a commit to mattgauf/llama.cpp that referenced this pull request Aug 26, 2023
* master: (773 commits)
  server : add `/detokenize` endpoint (ggerganov#2802)
  convert.py : advanced option (ggerganov#2753)
  llama : use Unicode Escape Sequence to replace encoded characters (ggerganov#2814)
  flake.nix : add rocm support and cleanup (ggerganov#2808)
  llama : move #includes out of _GNU_SOURCE conditional (ggerganov#2817)
  main : fix bug (penalize_nl=false doesn't work) + suppress warning on mingw (ggerganov#1528)
  llama : use std::abs in llama_sample_tail_free (ggerganov#2800)
  k-quants : remove unnecessary tensor shape restrictions (ggerganov#2811)
  Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B (ggerganov#2807)
  Fix HellaSwag (ggerganov#2805)
  flake : build llama.cpp on Intel with nix (ggerganov#2795)
  Handle null rope scaling value (ggerganov#2793)
  Fix spm whitespaces (ggerganov#2806)
  examples : skip unnecessary external lib in server README.md how-to (ggerganov#2804)
  llama : fix struct decl (ggerganov#2790)
  Faster perplexity computation (ggerganov#2786)
  llama : add llama_beam_search() (ggerganov#2267)
  convert.py : Get rope scale from HuggingFace models (ggerganov#2772)
  llama-bench : add model sizes (ggerganov#2771)
  convert.py : export rope freq_base when converting CodeLlama from an HF model (ggerganov#2773)
  ...
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* llama.cpp : fix spm whitespace escaping + clean up

* main.cpp : spm - add whitespace in front of prompt

* test-tokenizer-0.cpp : spm - add whitespace in front of prompt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants