Refactor: GGUF metadata tokenizer #389

polarathene · 2024-06-05T11:57:14Z

This was my progress on a refactor, but I didn't communicate this effort well with maintainers.

A similar effort was recently made and merged with a slightly different approach. I thus haven't adapted the from_gguf() methods in model files that could have used the approach I proposed here (a simple TryFrom on a struct of metadata fields). I was considering a derive macro to simplify that further, but it might have been overkill for the project 😅
I did propose upstreaming this into Candle directly, but like my other issues/PRs raised upstream I don't know if I'll get a timely response (my PR to the HF API crate still has no activity).

The merged alternative was rather extensive in changes, I don't have the motivation to go through all that. Some structs were renamed, methods changed, the gguf tokenizer file moved along with other changes that'll bring conflicts for rebasing.

If this PR were updated, it would simplify the from_gguf() metadata parameter extraction:

mistral.rs/mistralrs-core/src/models/quantized_llama.rs

Lines 274 to 287 in 89dea1b

    
           // Parameter extraction from metadata. 
        
           let n_expert = ct 
        
               .get_metadata("llama.expert_count") 
        
               .and_then(|v| v.to_u32()) 
        
               .unwrap_or(0) as usize; 
        
           let n_expert_used = ct 
        
               .get_metadata("llama.expert_used_count") 
        
               .and_then(|v| v.to_u32()) 
        
               .unwrap_or(0) as usize; 
        
           let head_count = ct.get_metadata("llama.attention.head_count")?.to_u32()? as usize; 
        
           let head_count_kv = ct.get_metadata("llama.attention.head_count_kv")?.to_u32()? as usize; 
        
           let block_count = ct.get_metadata("llama.block_count")?.to_u32()? as usize; 
        
           let embedding_length = ct.get_metadata("llama.embedding_length")?.to_u32()? as usize; 
        
           let rope_dim = ct.get_metadata("llama.rope.dimension_count")?.to_u32()? as usize;

But without the derive proc macro approach, due to all the usize and defaults, this PR may not be that worthwhile of an improvement.

The related tests changeset is probably still worth having. It fixes a bug with the encoder test case being misconfigured. It also removes the need for the dead_code annotations.

Each commit has a commit message for additional context.

Proper way to opt-out of the dead code warnings is annotate the test module as purely for testing.

Remove the repetitive noise to common functions under test. This also addresses a bug fix for the encode test case where the upstream encoder/decoder calls flip the meaning of the `bool` for handling special tokens.

Retrieving metadata items from their hashmap `Value` enum into primitive types with error handling is very verbose and noisy. Use traits to abstract all that away. This could also benefit usage in models `from_gguf()` methods. Meanwhile the unigram tokenizer has unified the special token handling at the end by keeping `unk` as a `u32` and only casting it to `usize` when actually needed.

The special token strings are also being created in the tokenizer now. A bit awkward, but unclear why only `unk` was an option, presumably the `bos` and `eos` may also need similar treatment to `unk`?

github-actions · 2024-06-05T11:58:21Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 24          864          731           25          108
 TOML                   15          403          365            1           37
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1057            0          783          274
 |- BASH                 6          203          190            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1566          472          792          302
-------------------------------------------------------------------------------
 Rust                   94        29779        27098          437         2244
 |- Markdown            48          479            0          463           16
 (Total)                          30258        27098          900         2260
===============================================================================
 Total                 162        32600        28609         1246         2745
===============================================================================

EricLBuehler · 2024-06-05T12:02:48Z

Hi @polarathene thanks. If I roll back the changes such that #380 was effectively never merged, would that work for you? I did it locally already so I can just push it up.

EricLBuehler · 2024-06-05T12:13:08Z

Ok, I reverted and there is just 1 simple merge conflict now :)

polarathene · 2024-06-05T12:19:04Z

If I roll back the changes such that #380 was effectively never merged, would that work for you?

Thanks! Much appreciated 🙏

Sorry about this, I'll try to better communicate when I'm working on a refactor in future 😓

I hope I didn't waste any of your time. You still had some nice changes I think, such as better restructuring of the files (something I've not given too much thought to).

EricLBuehler · 2024-06-05T12:24:43Z

No problem! In the future, if you let me know in Discord, I'll see it immediately so we can coordinate better.

You still had some nice changes I think, such as better restructuring of the files (something I've not given too much thought to).

Maybe you can do something similar? A key point is that each GGUF model has an md_get closure, which we can refactor with our own Content type? Perhaps you can find a better way to do this though :)

mistralrs-core/src/pipeline/gguf_tokenizer.rs

mistralrs-core/src/utils/gguf_metadata.rs

polarathene · 2024-06-05T12:40:07Z

A key point is that each GGUF model has an md_get closure

That's no longer necessary if you take the approach I did here with PropsGGUF (not the best name) stuct, just impl a similar TryFrom that maps all the keys and it should look a little bit more terser.

which we can refactor with our own Content type?

I don't know if we need to beyond the MetadataContext (also not a great name choice) struct? Unless there are other parts that would benefit, I was just focused on a cleaner API for extracting the needed metadata fields :)

I'll take a look after some rest, past midnight here 😅

Failing test is for the last encode test case:

HF tokenizer has <s> prepended, that's the only difference in content vs GGUF tokenizer. I'm not sure why that is. I know there was a bug in this particular test case before I touched it, so this PR might have just surfaced what was already there, rather than something I've introduced?

EricLBuehler · 2024-06-05T12:52:40Z

I don't know if we need to beyond the MetadataContext (also not a great name choice) struct? Unless there are other parts that would benefit, I was just focused on a cleaner API for extracting the needed metadata fields :)

I think this should be fine. Maybe we can call it ContentMetadata?

HF tokenizer has prepended, that's the only difference in content vs GGUF tokenizer. I'm not sure why that is. I know there was a bug in this particular test case, so this might have just surfaced what was already there, rather than something I've introduced?

~~No rush! Maybe it did indeed surface that bug, perhaps we aren't adding the bos token to the tokenizer?~~

ShelbyJenkins · 2024-06-05T22:22:46Z

This is a great thing to have. As far as my research, there isn't currently a crate that does this!

Not sure if it's an error in this quant, but it's listed as tokenizer.ggml.model: gpt2.

Given how prolific this hugging face user is I imagine people are going to try their quants and run into the same issue I had (only support for llama). Should I reach out to them or is it something that will be handled by this library?

EricLBuehler · 2024-06-05T22:38:01Z

@ShelbyJenkins thank you!

Not sure if it's an error in this quant, but it's listed as tokenizer.ggml.model: gpt2.

Given how prolific this hugging face user is I imagine people are going to try their quants and run into the same issue I had (only support for llama). Should I reach out to them or is it something that will be handled by this library?

We do the conversion in this codebase, so I can add support for the gpt2 tokenizer type. However, I don't think it is a mislabeling because the other fields for GPT2 are there. I'll try to add this soon.

EricLBuehler · 2024-06-06T01:14:58Z

@polarathene I took a look at it, and I think that the GGUF one is actually wrong here. It is not prepending a BOS token where one should not be. Regardless, GGUF tokenizers are documented to be less accurate than the HF ones, and the GGML docs themselves recommend using the HF one.

I have not seen any effect of this error, as the chat template adds that anyway. This can, however, be an issue if you are not using a model with a chat template.

I also did a cursory search for an API through tokenizers to enable this prepending behavior but could not find any. Could you perhaps find what we are missing? A key difference is that we use a unigram tokenizer here (that is what the GGUF tokenizer provides for llama tokenizers), but the tokenizer.json which we are testing against uses a BPE tokenizer.

As another note, I plan on merging this before #397 (which #388 will need for testing) so I am excited to merge!

polarathene · 2024-06-06T05:16:17Z

Could you perhaps find what we are missing?

I'm not quite sure how to? Is there no information in the GGUF to ensure the correct tokenizer is used?

Even with the improved tokenizer support after unigram, is the GGUF still lacking the information to use the appropriate tokenizer?

As another note, I plan on merging this before #397 (which #388 will need for testing) so I am excited to merge!

Sorry for the delay, I'll get onto applying the requested changes.

I'm still not sure how to proceed with the test case issue though.

EricLBuehler · 2024-06-06T09:08:23Z

I'm not quite sure how to? Is there no information in the GGUF to ensure the correct tokenizer is used? Even with the improved tokenizer support after unigram, is the GGUF still lacking the information to use the appropriate tokenizer?

Yes, the GGUF tokenizer specification is not as precise as the HF one, and we even use different tokenizer types.

Sorry for the delay, I'll get onto applying the requested changes. I'm still not sure how to proceed with the test case issue though.

No problem! Given that we actually found a mistake in the GGUF tokenizer, I think we should go back to only testing if special tokens = false just to get the cases to pass. This will just ensure the body of the text is the same, and while it is not ideal, that inaccuracy is coming from the GGUF tokenizer.

This is a partial change, the method will be changed over in subsequent commits.

For the quantized models to leverage. Additionally changes helper methods over to `anyhow::Error`.

EricLBuehler · 2024-06-07T00:25:26Z

Provided the model doesn't change much, you could easily leverage actions/cache in the CI to restore it locally to avoid making an external request each CI run?

Thanks, I'll look into using a cache. The Windows test is a bit flaky for this reason.

I could open a TODO issue for that if you'd like, since you're going to update this file next. Or leave with the TODO comment reminders I have in the PR 😅

Sounds good. I'll probably resolve it quickly, as I will bundle the changes in the next PR.

Similar to the enum workaround. As the upstream builder is awkward to use, an alternative one is implemented to improve the DX. The enum conversion to upstream wrapper types is handled in the builder now, simplifying usage in a tokenizer method.

This test fails presently. It is due to the mismatch of the HF tokenizer vs GGUF tokenizer used.

This special character looks very similar to `_` but it is not. This was a mistake I introduced when converting to local enums approach.

polarathene

Context for recent commits.

Should be good to merge now?

mistralrs-core/src/pipeline/gguf_tokenizer.rs

EricLBuehler

Thank you for the changes and comments! Can you please hardcode the test passage so that we have a consistent input? It does not need to be excessively long, just enough to test the capabilities of the tokenizer.

EricLBuehler · 2024-06-07T09:19:55Z

mistralrs-core/src/pipeline/gguf_tokenizer.rs

    fn get_test_passage() -> String {
+        // TODO: Why is it necessary to depend on this for a multi-line test string?


Perhaps we can hardcode this? It should be easy.

polarathene · 2024-06-07T10:32:27Z

Can you please hardcode the test passage so that we have a consistent input?

Sure.

I haven't gone over the contents of this llama.cpp tokenizer test, but it's probably a good reference (lists some edge cases they encountered too where some models fail).

For now I'll just go with a single input based on some content from there.

mistralrs-core/src/pipeline/gguf_tokenizer.rs

Added context for this particular configuration.

polarathene · 2024-06-07T22:03:28Z

@EricLBuehler your recent merge was based off your earlier PR work that you since reverted, you merged that and now there's a bunch of conflicts for changes unrelated to Phi3 vision 😓

I'm not even able to resolve those via the web UI this time apparently (not that I particularly would want to):

Potentially, revert the merge, and fix the history there?

That PR was 184 commits long though, it looks there were other changes like dropping some docs (Adding models, CLI docs), so I don't know what else was done that was/wasn't intentional 🤷‍♂️

This is the commit for removing the two docs files, but there is no context in the commit, only a title "Overhaul docs": d7a7c3c

EricLBuehler · 2024-06-07T22:08:46Z

Ok, sounds good. Is this PR ready for merge?

polarathene · 2024-06-07T22:10:57Z

Is this PR ready for merge?

Yes?

You asked for the test to be updated and I addressed that roughly 11 hours ago. But now there is the conflicts.

EricLBuehler · 2024-06-07T22:20:44Z

Ok, thanks for confirming. I'll undo the merge of that PR so that this can merge cleanly. Sorry for the inconvenience!

mistralrs-core/src/pipeline/gguf_tokenizer.rs

EricLBuehler · 2024-06-07T23:47:13Z

Thank you!

polarathene · 2024-06-08T00:17:21Z

Just to confirm, was the removal of those mentioned doc pages intentional? Seemed like useful information, just a bit outdated?

EricLBuehler · 2024-06-08T00:28:09Z

Sorry, which docs pages were removed?

polarathene · 2024-06-08T00:43:32Z

which docs pages were removed?

I referenced them at the end of the comment above. "Adding Models" / "CLI".

I then said if that was unintentional that there may have been other mishaps, but there was quite a bit of changes in the PR you merged.

EricLBuehler · 2024-06-08T00:46:57Z

Oh I see. I removed the "Adding models" in favor of a comment in the README contributing section because of the rate of development regarding that section, and the "CLI" document was prone to being outdated. If you know of a better way to autogenerate the CLI docs that would be great, I think we could do it in CI.

polarathene · 2024-06-08T01:09:28Z

If you know of a better way to autogenerate the CLI docs that would be great, I think we could do it in CI.

Yeah, you could probably have a template file to produce the markdown content? Or switch to a docs generator that can import content from separate sources.

Then for the CI part, you would just get the output of the CLI help and if it doesn't match the importable snippet, it'd update that and generate the page again.

Another way with CI is to have that sort of behaviour triggered on an automated schedule rather than commit triggers. That can then create a PR with the update, I've not done that before myself but I'm aware of something like that 🤔

EricLBuehler · 2024-06-08T01:14:46Z

Ok, I'll look into that. But before starting work on this, do you see static CLI docs as a useful tool when users can simply download and run with --help? That was my rationale.

polarathene · 2024-06-08T05:13:31Z

do you see static CLI docs as a useful tool when users can simply download and run with --help? That was my rationale.

It was useful as an insight when I first came across mistral.rs and was looking through the README. I think that was more of an issue with the README at the time being a bit confusing with instructions, and I knew I'd have to build the CLI.

If there's no additional context though, it's probably not that meaningful :)

I had a look over the deleted document, it doesn't appear to provide any value beyond an overview to reference without the CLI available. I wouldn't worry about restoring it 👍

EricLBuehler · 2024-06-08T08:53:03Z

Great. I did a refactor of the README yesterday, do you see any parts that are confusing or hard to find?

polarathene added 7 commits June 4, 2024 22:57

tests: Use cfg(test) attribute to avoid dead_code warnings

1545395

Proper way to opt-out of the dead code warnings is annotate the test module as purely for testing.

tests: DRY codec test cases

ad6ca10

Remove the repetitive noise to common functions under test. This also addresses a bug fix for the encode test case where the upstream encoder/decoder calls flip the meaning of the `bool` for handling special tokens.

chore: Add TODO note regarding test remote data dependency

f3ba6d9

refactor: Extract unigram tokenizer out of match statement

c9651fa

The special token strings are also being created in the tokenizer now. A bit awkward, but unclear why only `unk` was an option, presumably the `bos` and `eos` may also need similar treatment to `unk`?

chore: rustfmt adjustments + notes

5417221

refactor: GGUF Unigram Tokenizer Vocab construction

fe24df7

EricLBuehler force-pushed the master branch from 89dea1b to 93e451d Compare June 5, 2024 12:12

Merge branch 'master' into refactor/gguf-metadata-tokenizer

0c78b31

EricLBuehler mentioned this pull request Jun 5, 2024

Support multiple GGUF files #379

Merged

Update gguf_tokenizer.rs

ea4fd54

EricLBuehler reviewed Jun 5, 2024

View reviewed changes

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

mistralrs-core/src/utils/gguf_metadata.rs Outdated Show resolved Hide resolved

chore: Rename MetadataContext => ContentMetadata

fa70ffc

polarathene added 5 commits June 6, 2024 23:22

chore: verify_sanity_gguf() => verify_arch()

bbe4d00

This is a partial change, the method will be changed over in subsequent commits.

chore: Expand GGUF Value enum types support

4ee563a

For the quantized models to leverage. Additionally changes helper methods over to `anyhow::Error`.

refactor: GGUF metadata - quantized_llama.rs

ec16212

refactor: GGUF metadata - quantized_phi2.rs

4cf25e5

refactor: GGUF metadata - quantized_phi3.rs

c4dfe68

polarathene added 2 commits June 7, 2024 13:58

chore: MetadataContent path_prefix to &str

dba3024

This was referenced Jun 7, 2024

How to use TokenizerBuilder? huggingface/tokenizers#1549

Closed

Improve extracting values from gguf_file::Value huggingface/candle#2245

Open

This comment was marked as resolved.

Sign in to view

tests: Skip Decoder with special tokens

4b8d775

This test fails presently. It is due to the mismatch of the HF tokenizer vs GGUF tokenizer used.

polarathene force-pushed the refactor/gguf-metadata-tokenizer branch from 27c6cdf to 4b8d775 Compare June 7, 2024 02:57

fix: Decoder tests

67e972f

This special character looks very similar to `_` but it is not. This was a mistake I introduced when converting to local enums approach.

polarathene commented Jun 7, 2024

View reviewed changes

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

EricLBuehler reviewed Jun 7, 2024

View reviewed changes

tests: Replace web request with hard-coded string

74b3319

polarathene commented Jun 7, 2024

View reviewed changes

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

docs: Add maintenance reference comment

fe48b9c

Added context for this particular configuration.

EricLBuehler approved these changes Jun 7, 2024

View reviewed changes

mistralrs-core/src/pipeline/gguf_tokenizer.rs Show resolved Hide resolved

EricLBuehler merged commit 8b2d092 into EricLBuehler:master Jun 7, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: GGUF metadata tokenizer #389

Refactor: GGUF metadata tokenizer #389

polarathene commented Jun 5, 2024

github-actions bot commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 5, 2024

polarathene commented Jun 5, 2024

EricLBuehler commented Jun 5, 2024 •

edited

Loading

polarathene commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 5, 2024

ShelbyJenkins commented Jun 5, 2024

EricLBuehler commented Jun 5, 2024

EricLBuehler commented Jun 6, 2024 •

edited

Loading

polarathene commented Jun 6, 2024

EricLBuehler commented Jun 6, 2024 •

edited

Loading

EricLBuehler commented Jun 7, 2024

This comment was marked as resolved.

polarathene left a comment

EricLBuehler left a comment

EricLBuehler Jun 7, 2024

polarathene commented Jun 7, 2024

polarathene commented Jun 7, 2024 •

edited

Loading

EricLBuehler commented Jun 7, 2024

polarathene commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

	// Parameter extraction from metadata.
	let n_expert = ct
	.get_metadata("llama.expert_count")
	.and_then(\|v\| v.to_u32())
	.unwrap_or(0) as usize;
	let n_expert_used = ct
	.get_metadata("llama.expert_used_count")
	.and_then(\|v\| v.to_u32())
	.unwrap_or(0) as usize;
	let head_count = ct.get_metadata("llama.attention.head_count")?.to_u32()? as usize;
	let head_count_kv = ct.get_metadata("llama.attention.head_count_kv")?.to_u32()? as usize;
	let block_count = ct.get_metadata("llama.block_count")?.to_u32()? as usize;
	let embedding_length = ct.get_metadata("llama.embedding_length")?.to_u32()? as usize;
	let rope_dim = ct.get_metadata("llama.rope.dimension_count")?.to_u32()? as usize;

		fn get_test_passage() -> String {
		// TODO: Why is it necessary to depend on this for a multi-line test string?

Refactor: GGUF metadata tokenizer #389

Refactor: GGUF metadata tokenizer #389

Conversation

polarathene commented Jun 5, 2024

github-actions bot commented Jun 5, 2024 • edited Loading

EricLBuehler commented Jun 5, 2024 • edited Loading

EricLBuehler commented Jun 5, 2024

polarathene commented Jun 5, 2024

EricLBuehler commented Jun 5, 2024 • edited Loading

polarathene commented Jun 5, 2024 • edited Loading

EricLBuehler commented Jun 5, 2024

ShelbyJenkins commented Jun 5, 2024

EricLBuehler commented Jun 5, 2024

EricLBuehler commented Jun 6, 2024 • edited Loading

polarathene commented Jun 6, 2024

EricLBuehler commented Jun 6, 2024 • edited Loading

EricLBuehler commented Jun 7, 2024

This comment was marked as resolved.

polarathene left a comment

Choose a reason for hiding this comment

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler Jun 7, 2024

Choose a reason for hiding this comment

polarathene commented Jun 7, 2024

polarathene commented Jun 7, 2024 • edited Loading

EricLBuehler commented Jun 7, 2024

polarathene commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

polarathene commented Jun 8, 2024

EricLBuehler commented Jun 8, 2024

github-actions bot commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 5, 2024 •

edited

Loading

polarathene commented Jun 5, 2024 •

edited

Loading

EricLBuehler commented Jun 6, 2024 •

edited

Loading

EricLBuehler commented Jun 6, 2024 •

edited

Loading

polarathene commented Jun 7, 2024 •

edited

Loading