Skip to content

Added Tokenizer's APIs for v2 #7512

@tarekgh

Description

@tarekgh

This issue tracks the APIs added to the tokenizer library.

Proposal

BpeTokenizer

We’ve already established a pattern for creating tokenizers using Tokenizer.Create(...). When multiple parameters are required, we wrap them in an Options object. For example, BertTokenizer.Create accepts a BertOptions parameter.

Following this pattern, we’re adding a new Create method to BpeTokenizer and introducing the BpeOptions class to encapsulate the parameters passed to Create. Currently, BpeTokenizer has a Create method that takes flat parameters:

        public static BpeTokenizer Create(
                                string vocabFile,
                                string? mergesFile,
                                PreTokenizer? preTokenizer = null,
                                Normalizer? normalizer = null,
                                IReadOnlyDictionary<string, int>? specialTokens = null,
                                string? unknownToken = null,
                                string? continuingSubwordPrefix = null,
                                string? endOfWordSuffix = null,
                                bool fuseUnknownTokens = false)

The proposal here is wrapping all parameters into BpeOptions object.

  namespace Microsoft.ML.Tokenizers
  {
      public sealed class BpeTokenizer : Tokenizer
      {
+         public static BpeTokenizer Create(BpeOptions options);
         
+          public bool? ByteLevel { get; }
+           public string? BeginningOfSentenceToken { get; }

      }

+     public sealed class BpeOptions
+     {
+         public BpeOptions(System.Collections.Generic.IEnumerable<(string, int)> vocabulary);
+
+         public string? BeginningOfSentenceToken { get; set; }
+         public string? ContinuingSubwordPrefix { get; set; }
+         public string? EndOfSentenceToken { get; set; }
+         public string? EndOfWordSuffix { get; set; }
+         public bool? FuseUnknownTokens { get; set; }
+         public IEnumerable<string>? Merges { get; set; }
+         public Normalizer? Normalizer { get; set; }
+         public PreTokenizer? PreTokenizer { get; set; }
+         public IReadOnlyDictionary<string, int> SpecialTokens { get; set; }
+         public string? UnknownToken { get; set; }
+         public IEnumerable<(string, int)> Vocabulary { get; }
+         public bool? ByteLevel { get; set; }
+     }

Notes

  • Added a new ByteLevel property to enable BPE tokenizer support for ByteLevel. This handles vocabularies stored as bytes (typically UTF-8 encoded) and ensures text is pre-tokenized accordingly.
  • Introduced BeginningOfSentenceToken, an optional token that can be inserted at the start when encoding text.
  • Vocab and Merges are now passed as IEnumerable to provide flexibility, since these data sources may come from different origins.

SentencePieceTokenizer

We already have LlamaTokenizer.Create, with LlamaTokenizer subclassing SentencePieceTokenizer. Since SentencePieceTokenizer now supports multiple internal models (Bpe and Unigram), we should expose the Create method directly from SentencePieceTokenizer rather than exposing separate classes for each model. The model type is already embedded in the tokenizer file passed to Create.

        public static new LlamaTokenizer.Create(Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false, IReadOnlyDictionary<string, int>? specialTokens = null)

The proposal is to have the following Create method:

  namespace Microsoft.ML.Tokenizers
  {

      public class SentencePieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
      {
+         public static SentencePieceTokenizer Create(Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false, IReadOnlyDictionary<string, int> specialTokens = null);
      }
   }

CompositePreTokenizer

A pre-tokenizer is used to split input text into smaller chunks before tokenization and encoding. In some scenarios, such as DeepSeek, multiple pre-tokenizers are required to run in sequence. To support this, the proposal is to introduce a CompositePreTokenizer, which implements the PreTokenizer abstraction.

  namespace Microsoft.ML.Tokenizers
  {
+     public class CompositePreTokenizer : PreTokenizer
+     {
+         public CompositePreTokenizer(IReadOnlyList<Tokenizers.PreTokenizer> preTokenizers, IReadOnlyDictionary<string, int> specialTokens = null);
+         public IReadOnlyList<PreTokenizer> PreTokenizers { get; }

           public override IEnumerable<(int, int)> PreTokenize(System.ReadOnlySpan<char> text);
           public override IEnumerable<(int, int)> PreTokenize(string text);
+     }
   }

Metadata

Metadata

Assignees

No one assigned

    Labels

    Tokenizersapi-approvedAPI was approved in API review, it can be implemented

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions