Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: GGUF metadata tokenizer #389

Merged
Merged
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1545395
tests: Use `cfg(test)` attribute to avoid `dead_code` warnings
polarathene Jun 4, 2024
ad6ca10
tests: DRY codec test cases
polarathene Jun 4, 2024
f3ba6d9
chore: Add `TODO` note regarding test remote data dependency
polarathene Jun 4, 2024
18f1567
refactor: DRY metadata extraction
polarathene Jun 4, 2024
c9651fa
refactor: Extract `unigram` tokenizer out of match statement
polarathene Jun 5, 2024
5417221
chore: `rustfmt` adjustments + notes
polarathene Jun 5, 2024
fe24df7
refactor: GGUF Unigram Tokenizer Vocab construction
polarathene Jun 5, 2024
0c78b31
Merge branch 'master' into refactor/gguf-metadata-tokenizer
polarathene Jun 5, 2024
ea4fd54
Update gguf_tokenizer.rs
polarathene Jun 5, 2024
fa70ffc
chore: Rename `MetadataContext` => `ContentMetadata`
polarathene Jun 6, 2024
bbe4d00
chore: `verify_sanity_gguf()` => `verify_arch()`
polarathene Jun 6, 2024
4ee563a
chore: Expand GGUF `Value` enum types support
polarathene Jun 6, 2024
ec16212
refactor: GGUF metadata - `quantized_llama.rs`
polarathene Jun 6, 2024
4cf25e5
refactor: GGUF metadata - `quantized_phi2.rs`
polarathene Jun 6, 2024
c4dfe68
refactor: GGUF metadata - `quantized_phi3.rs`
polarathene Jun 6, 2024
bbea097
refactor: GGUF metadata - X-LoRA llama + phi3
polarathene Jun 6, 2024
86f538c
tests: Skip encoder test case for special tokens
polarathene Jun 6, 2024
8bdc736
Update mistralrs-core/src/pipeline/gguf_tokenizer.rs
polarathene Jun 6, 2024
b3705c3
refactor: Use convenience enums for Decoder and Normalizer inputs
polarathene Jun 7, 2024
130b1ac
chore: Add a tokenizer builder workaround
polarathene Jun 7, 2024
dba3024
chore: `MetadataContent` path_prefix to `&str`
polarathene Jun 7, 2024
4b8d775
tests: Skip Decoder with special tokens
polarathene Jun 7, 2024
67e972f
fix: Decoder tests
polarathene Jun 7, 2024
74b3319
tests: Replace web request with hard-coded string
polarathene Jun 7, 2024
fe48b9c
docs: Add maintenance reference comment
polarathene Jun 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
chore: Add a tokenizer builder workaround
Similar to the enum workaround. As the upstream builder is awkward to use, an alternative one is implemented to improve the DX.

The enum conversion to upstream wrapper types is handled in the builder now, simplifying usage in a tokenizer method.
  • Loading branch information
polarathene committed Jun 7, 2024
commit 130b1acecfeb7e9175e9e47e57ab481230942baf
66 changes: 50 additions & 16 deletions mistralrs-core/src/pipeline/gguf_tokenizer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -103,35 +103,41 @@ fn unigram_tokenizer(p: &PropsGGUF) -> Result<(Tokenizer, TokenizerKind, Vec<Str
// Unigram (SentencePiece) default UNK is 0
let unk = unk.unwrap_or(0);

let vocab: Vec<(String, f64)> = {
let Some(s) = p.scores.as_ref() else {
anyhow::bail!(
"`llama` unigram tokenizer is missing required metadata `tokenizer.ggml.scores`"
);
// Create the Tokenizer model:
let model = {
let vocab: Vec<(String, f64)> = {
let Some(s) = p.scores.as_ref() else {
anyhow::bail!(
"`llama` unigram tokenizer is missing required metadata `tokenizer.ggml.scores`"
);
};
let scores = s.iter().cloned().map(|f_32| f_32 as f64);

p.tokens.iter().cloned().zip(scores).collect()
};
let scores = s.iter().cloned().map(|f_32| f_32 as f64);

p.tokens.iter().cloned().zip(scores).collect()
Unigram::from(vocab, Some(unk as usize), true).map_err(anyhow::Error::msg)?
};

let unigram = Unigram::from(vocab, Some(unk as usize), true).map_err(anyhow::Error::msg)?;

let decoder = DecoderWrapper::try_from(Decoder::Sequence(vec![
let decoder = Decoder::Sequence(vec![
Decoder::Replace("_", " "),
Decoder::ByteFallback,
Decoder::Fuse,
Decoder::Strip(' ', 1, 0),
]))?;
]);

let normalizer = NormalizerWrapper::try_from(Normalizer::Sequence(vec![
let normalizer = Normalizer::Sequence(vec![
Normalizer::Prepend("▁"),
Normalizer::Replace(" ", "▁"),
]))?;
]);

let mut tokenizer = Tokenizer::new(ModelWrapper::Unigram(unigram));
tokenizer.with_decoder(decoder);
tokenizer.with_normalizer(normalizer);
let mut tokenizer: Tokenizer = TokenizerX::try_builder()
.with_model(model)
.with_decoder(decoder)
.with_normalizer(normalizer)
.build()?;

// Add special tokens (bos, eos, unk):
let mut special_tokens = Vec::<String>::new();
for token_id in [bos, eos, unk] {
let token = p.tokens[token_id as usize].as_str();
Expand All @@ -143,6 +149,34 @@ fn unigram_tokenizer(p: &PropsGGUF) -> Result<(Tokenizer, TokenizerKind, Vec<Str
Ok((tokenizer, TokenizerKind::Unigram, special_tokens))
}

// This is a workaround to have a better builder API.
// Upstream `TokenizerBuilder` is difficult to work with:
// https://github.com/huggingface/tokenizers/issues/1549
struct TokenizerX;
#[buildstructor::buildstructor]
impl TokenizerX {
#[builder]
fn try_new<'a>(
with_model: ModelWrapper,
with_decoder: Option<Decoder<'a>>,
with_normalizer: Option<Normalizer<'a>>,
) -> Result<Tokenizer> {
let mut tokenizer = Tokenizer::new(with_model);

// Handle local enum to remote enum type:
if let Some(decoder) = with_decoder {
let d = DecoderWrapper::try_from(decoder)?;
tokenizer.with_decoder(d);
}
if let Some(normalizer) = with_normalizer {
let n = NormalizerWrapper::try_from(normalizer)?;
tokenizer.with_normalizer(n);
}

Ok(tokenizer)
}
}
EricLBuehler marked this conversation as resolved.
Show resolved Hide resolved

// Convenient alternative to upstream:
// https://docs.rs/tokenizers/latest/tokenizers/decoders/enum.DecoderWrapper.html
enum Decoder<'a> {
Expand Down