-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf #73
Conversation
Nice! You will still have to update the tokenizer in C++ code quite a bit. I think this is a test prompt to verify it is working: 关于爱因斯坦的生平。他出生于 If not you can try just this character as prompt: 篇篇篇篇篇篇 |
You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space. |
Seems fine to me? (Using Apple Terminal). The token readout at the start is messed up as expected (since some of the tokens aren’t valid UTF-8 strings) but that’s fine IMO.
.
.
Compare the 13B model (without my patch):
Those underscore things are in the token file, so I’m replacing them with a regular space when constructing the ggml bin file. I don’t think the C++ code needs to be updated to handle that? |
Oh interesting, right you are that is good. But beware of the input. What does the output from the program report about your input prompt? Cause your input may be garbled I would assume from this code unable to find tokens, and the input garbled can still result in a correct output prompt. |
See here:
The tokenizer in this code can only return 1 token per string. You need multiple tokens for a string. Oh edit maybe Im wrong, wrong function!
Maybe it just works!?? |
Please check the sequence of tokens. Using the tokenizer I get this and yours should match (I also have garbled at the start its consequence of the other code there):
So
|
Looks right:
|
That's beautiful ship it! But now I have to regenerate my models :( |
I think this might work to avoid using protobuf? for i in range(32000):
if tokenizer.is_unknown(i):
# "<unk>" token (translated as ??)
text = " \u2047 ".encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
elif tokenizer.is_control(i):
# "<s>"/"</s>" tokens
fout.write(struct.pack("i", 0))
elif tokenizer.is_byte(i):
# "<U+XX>" tokens (which may be invalid UTF-8)
piece = tokenizer.id_to_piece(i)
if len(piece) != 6:
print("Invalid token: " + piece)
sys.exit(1)
byte_value = int(piece[3:-1], 16)
fout.write(struct.pack("i", 1))
fout.write(struct.pack("B", byte_value))
else:
# normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text) I can see that it writes the correct bytes, but my terminal has a hard time handling them for some reason. |
@kharvd yes that is true. I'm somewhat confused because sentencepiece uses protobuf. Maybe the c++ version compiled into python wheel has it built in which made it not possible to use as a sub package from sentencepiece? Either way I think that approach also works. But its irony we don't want to use sentencepiece C++ library, so instead we will use require sentencepiece python library 😅 Here is the PR to include sentencepiece C++ library Happy to close it if we merge this masterpiece. But some questions remain such as this model is not portable between the webui etc now. If we used C++ version we could have portable model files floating around between the two projects I think. |
Oh yeah, I figured out why my terminal still made weird characters: it's the |
Sample output:
|
@kharvd what model are you using there, The google translate of your output appears to be gibberish. I think we need a translator :) Heres some examples from me at 16B model 关于爱因斯坦的生平。他出生于1856年,是一位欧洲科学家和教育家。他在1902年获得了诺基丛大学院士学位。 I dont have 7B ready at the moment but it shouldn't be that bad I didn't think? Here is the Google Translate of your output: |
This is 7B |
Ah no worries with your settings I also get gibberish at 16B
Try
|
Here's 13B with default parameters:
"About Einstein's life. Born in 1856, he was a German chemist, astronomer and thermostat researcher. Found in high-flying aircraft carriers in the early 20th century, Einstein used" |
Closing, #79 is better. |
(processing with grep, less, etc.)
Everything seems to be working fine after regenerating and requantizing the 7B model!
There may still be issues with printing the tokens, my quantization step hasn’t finished yet so I haven’t tested the updated models.I decided to vendor the protobuf file (and the .py file generated via
protoc --python_out=. sentencepiece_model.proto
) since they are very very unlikely to change and so that the install process can remain simple.