Skip to content

How to use TokenizerBuilder? #1549

@polarathene

Description

@polarathene

I expected TokenizerBuilder to produce a Tokenizer from the build() result, but instead Tokenizer wraps TokenizerImpl.

No problem, I see that it impl From<TokenizerImpl> for Tokenizer, but it's attempting to do quite a bit more for some reason? Meanwhile I cannot use Tokenizer(unwrapped_build_result_here) as the struct is private 🤔 (while the Tokenizer::new() method won't take this in either)


let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    .with_model(unigram)
    .with_decoder(Some(decoder))
    .with_normalizer(Some(normalizer))
    .build()
    .map_err(anyhow::Error::msg)?
);
error[E0283]: type annotations needed
   --> mistralrs-core/src/pipeline/gguf_tokenizer.rs:139:41
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::new()
    |                                         ^^^^^^^^^^^^^^^^^^^^^ cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`
    |
    = note: cannot satisfy `_: tokenizers::PreTokenizer`
    = help: the following types implement trait `tokenizers::PreTokenizer`:
              tokenizers::pre_tokenizers::bert::BertPreTokenizer
              tokenizers::decoders::byte_level::ByteLevel
              tokenizers::pre_tokenizers::delimiter::CharDelimiterSplit
              tokenizers::pre_tokenizers::digits::Digits
              tokenizers::decoders::metaspace::Metaspace
              tokenizers::pre_tokenizers::punctuation::Punctuation
              tokenizers::pre_tokenizers::sequence::Sequence
              tokenizers::pre_tokenizers::split::Split
            and 4 others
note: required by a bound in `tokenizers::TokenizerBuilder::<M, N, PT, PP, D>::new`
   --> /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/mod.rs:314:9
    |
314 |     PT: PreTokenizer,
    |         ^^^^^^^^^^^^ required by this bound in `TokenizerBuilder::<M, N, PT, PP, D>::new`
...
319 |     pub fn new() -> Self {
    |            --- required by a bound in this associated function
help: consider specifying the generic arguments
    |
139 |     let mut tokenizer = Tokenizer::from(TokenizerBuilder::<tokenizers::models::unigram::Unigram, tokenizers::NormalizerWrapper, PT, PP, tokenizers::DecoderWrapper>::new()
    |                                                         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Why is this an issue? Isn't the point of the builder so that you don't have to specify the optional types not explicitly set?

cannot infer type of the type parameter `PT` declared on the struct `TokenizerBuilder`

I had a glance over the source on github but didn't see an example or test for using this API and the docs don't really cover it either.


Meanwhile with Tokenizer instead of TokenizerBuilder this works:

let mut tokenizer = Tokenizer::new(tokenizers::ModelWrapper::Unigram(unigram));
tokenizer.with_decoder(decoder);
tokenizer.with_normalizer(normalizer);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions