Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: GGUF metadata tokenizer #389

Merged

Conversation

polarathene
Copy link
Contributor

This was my progress on a refactor, but I didn't communicate this effort well with maintainers.

  • A similar effort was recently made and merged with a slightly different approach. I thus haven't adapted the from_gguf() methods in model files that could have used the approach I proposed here (a simple TryFrom on a struct of metadata fields). I was considering a derive macro to simplify that further, but it might have been overkill for the project 😅
  • I did propose upstreaming this into Candle directly, but like my other issues/PRs raised upstream I don't know if I'll get a timely response (my PR to the HF API crate still has no activity).

The merged alternative was rather extensive in changes, I don't have the motivation to go through all that. Some structs were renamed, methods changed, the gguf tokenizer file moved along with other changes that'll bring conflicts for rebasing.


If this PR were updated, it would simplify the from_gguf() metadata parameter extraction:

// Parameter extraction from metadata.
let n_expert = ct
.get_metadata("llama.expert_count")
.and_then(|v| v.to_u32())
.unwrap_or(0) as usize;
let n_expert_used = ct
.get_metadata("llama.expert_used_count")
.and_then(|v| v.to_u32())
.unwrap_or(0) as usize;
let head_count = ct.get_metadata("llama.attention.head_count")?.to_u32()? as usize;
let head_count_kv = ct.get_metadata("llama.attention.head_count_kv")?.to_u32()? as usize;
let block_count = ct.get_metadata("llama.block_count")?.to_u32()? as usize;
let embedding_length = ct.get_metadata("llama.embedding_length")?.to_u32()? as usize;
let rope_dim = ct.get_metadata("llama.rope.dimension_count")?.to_u32()? as usize;

But without the derive proc macro approach, due to all the usize and defaults, this PR may not be that worthwhile of an improvement.


The related tests changeset is probably still worth having. It fixes a bug with the encoder test case being misconfigured. It also removes the need for the dead_code annotations.

Each commit has a commit message for additional context.

Proper way to opt-out of the dead code warnings is annotate the test module as purely for testing.
Remove the repetitive noise to common functions under test.

This also addresses a bug fix for the encode test case where the upstream encoder/decoder calls flip the meaning of the `bool` for handling special tokens.
Retrieving metadata items from their hashmap `Value` enum into primitive types with error handling is very verbose and noisy.

Use traits to abstract all that away. This could also benefit usage in models `from_gguf()` methods.

Meanwhile the unigram tokenizer has unified the special token handling at the end by keeping `unk` as a `u32` and only casting it to `usize` when actually needed.
The special token strings are also being created in the tokenizer now. A bit awkward, but unclear why only `unk` was an option, presumably the `bos` and `eos` may also need similar treatment to `unk`?
Copy link

github-actions bot commented Jun 5, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 24          864          731           25          108
 TOML                   15          403          365            1           37
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1057            0          783          274
 |- BASH                 6          203          190            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1566          472          792          302
-------------------------------------------------------------------------------
 Rust                   94        29779        27098          437         2244
 |- Markdown            48          479            0          463           16
 (Total)                          30258        27098          900         2260
===============================================================================
 Total                 162        32600        28609         1246         2745
===============================================================================
  

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 5, 2024

Hi @polarathene thanks. If I roll back the changes such that #380 was effectively never merged, would that work for you? I did it locally already so I can just push it up.

@EricLBuehler
Copy link
Owner

Ok, I reverted and there is just 1 simple merge conflict now :)

@polarathene
Copy link
Contributor Author

If I roll back the changes such that #380 was effectively never merged, would that work for you?

Thanks! Much appreciated 🙏

Sorry about this, I'll try to better communicate when I'm working on a refactor in future 😓

I hope I didn't waste any of your time. You still had some nice changes I think, such as better restructuring of the files (something I've not given too much thought to).

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 5, 2024

No problem! In the future, if you let me know in Discord, I'll see it immediately so we can coordinate better.

You still had some nice changes I think, such as better restructuring of the files (something I've not given too much thought to).

Maybe you can do something similar? A key point is that each GGUF model has an md_get closure, which we can refactor with our own Content type? Perhaps you can find a better way to do this though :)

@polarathene
Copy link
Contributor Author

polarathene commented Jun 5, 2024

A key point is that each GGUF model has an md_get closure

That's no longer necessary if you take the approach I did here with PropsGGUF (not the best name) stuct, just impl a similar TryFrom that maps all the keys and it should look a little bit more terser.

which we can refactor with our own Content type?

I don't know if we need to beyond the MetadataContext (also not a great name choice) struct? Unless there are other parts that would benefit, I was just focused on a cleaner API for extracting the needed metadata fields :)


I'll take a look after some rest, past midnight here 😅

Failing test is for the last encode test case:

HF tokenizer has <s> prepended, that's the only difference in content vs GGUF tokenizer. I'm not sure why that is. I know there was a bug in this particular test case before I touched it, so this PR might have just surfaced what was already there, rather than something I've introduced?

@EricLBuehler
Copy link
Owner

I don't know if we need to beyond the MetadataContext (also not a great name choice) struct? Unless there are other parts that would benefit, I was just focused on a cleaner API for extracting the needed metadata fields :)

I think this should be fine. Maybe we can call it ContentMetadata?

HF tokenizer has prepended, that's the only difference in content vs GGUF tokenizer. I'm not sure why that is. I know there was a bug in this particular test case, so this might have just surfaced what was already there, rather than something I've introduced?

No rush! Maybe it did indeed surface that bug, perhaps we aren't adding the bos token to the tokenizer?

@ShelbyJenkins
Copy link

This is a great thing to have. As far as my research, there isn't currently a crate that does this!

Not sure if it's an error in this quant, but it's listed as tokenizer.ggml.model: gpt2.

Given how prolific this hugging face user is I imagine people are going to try their quants and run into the same issue I had (only support for llama). Should I reach out to them or is it something that will be handled by this library?

@EricLBuehler
Copy link
Owner

@ShelbyJenkins thank you!

Not sure if it's an error in this quant, but it's listed as tokenizer.ggml.model: gpt2.

Given how prolific this hugging face user is I imagine people are going to try their quants and run into the same issue I had (only support for llama). Should I reach out to them or is it something that will be handled by this library?

We do the conversion in this codebase, so I can add support for the gpt2 tokenizer type. However, I don't think it is a mislabeling because the other fields for GPT2 are there. I'll try to add this soon.

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 6, 2024

@polarathene I took a look at it, and I think that the GGUF one is actually wrong here. It is not prepending a BOS token where one should not be. Regardless, GGUF tokenizers are documented to be less accurate than the HF ones, and the GGML docs themselves recommend using the HF one.

I have not seen any effect of this error, as the chat template adds that anyway. This can, however, be an issue if you are not using a model with a chat template.

I also did a cursory search for an API through tokenizers to enable this prepending behavior but could not find any. Could you perhaps find what we are missing? A key difference is that we use a unigram tokenizer here (that is what the GGUF tokenizer provides for llama tokenizers), but the tokenizer.json which we are testing against uses a BPE tokenizer.

As another note, I plan on merging this before #397 (which #388 will need for testing) so I am excited to merge!

@polarathene
Copy link
Contributor Author

Could you perhaps find what we are missing?

I'm not quite sure how to? Is there no information in the GGUF to ensure the correct tokenizer is used?

Even with the improved tokenizer support after unigram, is the GGUF still lacking the information to use the appropriate tokenizer?


As another note, I plan on merging this before #397 (which #388 will need for testing) so I am excited to merge!

Sorry for the delay, I'll get onto applying the requested changes.

I'm still not sure how to proceed with the test case issue though.

@EricLBuehler
Copy link
Owner

EricLBuehler commented Jun 6, 2024

I'm not quite sure how to? Is there no information in the GGUF to ensure the correct tokenizer is used? Even with the improved tokenizer support after unigram, is the GGUF still lacking the information to use the appropriate tokenizer?

Yes, the GGUF tokenizer specification is not as precise as the HF one, and we even use different tokenizer types.

Sorry for the delay, I'll get onto applying the requested changes. I'm still not sure how to proceed with the test case issue though.

No problem! Given that we actually found a mistake in the GGUF tokenizer, I think we should go back to only testing if special tokens = false just to get the cases to pass. This will just ensure the body of the text is the same, and while it is not ideal, that inaccuracy is coming from the GGUF tokenizer.

This is a partial change, the method will be changed over in subsequent commits.
For the quantized models to leverage.

Additionally changes helper methods over to `anyhow::Error`.
@EricLBuehler
Copy link
Owner

Provided the model doesn't change much, you could easily leverage actions/cache in the CI to restore it locally to avoid making an external request each CI run?

Thanks, I'll look into using a cache. The Windows test is a bit flaky for this reason.

I could open a TODO issue for that if you'd like, since you're going to update this file next. Or leave with the TODO comment reminders I have in the PR 😅

Sounds good. I'll probably resolve it quickly, as I will bundle the changes in the next PR.

Similar to the enum workaround. As the upstream builder is awkward to use, an alternative one is implemented to improve the DX.

The enum conversion to upstream wrapper types is handled in the builder now, simplifying usage in a tokenizer method.
@polarathene

This comment was marked as resolved.

This test fails presently. It is due to the mismatch of the HF tokenizer vs GGUF tokenizer used.
@polarathene polarathene force-pushed the refactor/gguf-metadata-tokenizer branch from 27c6cdf to 4b8d775 Compare June 7, 2024 02:57
This special character looks very similar to `_` but it is not. This was a mistake I introduced when converting to local enums approach.
Copy link
Contributor Author

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context for recent commits.

Should be good to merge now?

Copy link
Owner

@EricLBuehler EricLBuehler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes and comments! Can you please hardcode the test passage so that we have a consistent input? It does not need to be excessively long, just enough to test the capabilities of the tokenizer.

fn get_test_passage() -> String {
// TODO: Why is it necessary to depend on this for a multi-line test string?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can hardcode this? It should be easy.

@polarathene
Copy link
Contributor Author

Can you please hardcode the test passage so that we have a consistent input?

Sure.

I haven't gone over the contents of this llama.cpp tokenizer test, but it's probably a good reference (lists some edge cases they encountered too where some models fail).

For now I'll just go with a single input based on some content from there.

Added context for this particular configuration.
@polarathene
Copy link
Contributor Author

polarathene commented Jun 7, 2024

@EricLBuehler your recent merge was based off your earlier PR work that you since reverted, you merged that and now there's a bunch of conflicts for changes unrelated to Phi3 vision 😓

I'm not even able to resolve those via the web UI this time apparently (not that I particularly would want to):

image

Potentially, revert the merge, and fix the history there?

That PR was 184 commits long though, it looks there were other changes like dropping some docs (Adding models, CLI docs), so I don't know what else was done that was/wasn't intentional 🤷‍♂️


This is the commit for removing the two docs files, but there is no context in the commit, only a title "Overhaul docs": d7a7c3c

@EricLBuehler
Copy link
Owner

Ok, sounds good. Is this PR ready for merge?

@polarathene
Copy link
Contributor Author

Is this PR ready for merge?

Yes?

You asked for the test to be updated and I addressed that roughly 11 hours ago. But now there is the conflicts.

@EricLBuehler
Copy link
Owner

Ok, thanks for confirming. I'll undo the merge of that PR so that this can merge cleanly. Sorry for the inconvenience!

@EricLBuehler EricLBuehler merged commit 8b2d092 into EricLBuehler:master Jun 7, 2024
11 checks passed
@EricLBuehler
Copy link
Owner

Thank you!

@polarathene
Copy link
Contributor Author

Just to confirm, was the removal of those mentioned doc pages intentional? Seemed like useful information, just a bit outdated?

@EricLBuehler
Copy link
Owner

Sorry, which docs pages were removed?

@polarathene
Copy link
Contributor Author

which docs pages were removed?

I referenced them at the end of the comment above. "Adding Models" / "CLI".

I then said if that was unintentional that there may have been other mishaps, but there was quite a bit of changes in the PR you merged.

@EricLBuehler
Copy link
Owner

Oh I see. I removed the "Adding models" in favor of a comment in the README contributing section because of the rate of development regarding that section, and the "CLI" document was prone to being outdated. If you know of a better way to autogenerate the CLI docs that would be great, I think we could do it in CI.

@polarathene
Copy link
Contributor Author

If you know of a better way to autogenerate the CLI docs that would be great, I think we could do it in CI.

Yeah, you could probably have a template file to produce the markdown content? Or switch to a docs generator that can import content from separate sources.

Then for the CI part, you would just get the output of the CLI help and if it doesn't match the importable snippet, it'd update that and generate the page again.

Another way with CI is to have that sort of behaviour triggered on an automated schedule rather than commit triggers. That can then create a PR with the update, I've not done that before myself but I'm aware of something like that 🤔

@EricLBuehler
Copy link
Owner

Ok, I'll look into that. But before starting work on this, do you see static CLI docs as a useful tool when users can simply download and run with --help? That was my rationale.

@polarathene
Copy link
Contributor Author

do you see static CLI docs as a useful tool when users can simply download and run with --help? That was my rationale.

It was useful as an insight when I first came across mistral.rs and was looking through the README. I think that was more of an issue with the README at the time being a bit confusing with instructions, and I knew I'd have to build the CLI.

If there's no additional context though, it's probably not that meaningful :)

I had a look over the deleted document, it doesn't appear to provide any value beyond an overview to reference without the CLI available. I wouldn't worry about restoring it 👍

@EricLBuehler
Copy link
Owner

Great. I did a refactor of the README yesterday, do you see any parts that are confusing or hard to find?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants