-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
This issue tracks the APIs added to the tokenizer library.
Proposal
BpeTokenizer
We’ve already established a pattern for creating tokenizers using Tokenizer.Create(...). When multiple parameters are required, we wrap them in an Options object. For example, BertTokenizer.Create accepts a BertOptions parameter.
Following this pattern, we’re adding a new Create method to BpeTokenizer and introducing the BpeOptions class to encapsulate the parameters passed to Create. Currently, BpeTokenizer has a Create method that takes flat parameters:
public static BpeTokenizer Create(
string vocabFile,
string? mergesFile,
PreTokenizer? preTokenizer = null,
Normalizer? normalizer = null,
IReadOnlyDictionary<string, int>? specialTokens = null,
string? unknownToken = null,
string? continuingSubwordPrefix = null,
string? endOfWordSuffix = null,
bool fuseUnknownTokens = false)The proposal here is wrapping all parameters into BpeOptions object.
namespace Microsoft.ML.Tokenizers
{
public sealed class BpeTokenizer : Tokenizer
{
+ public static BpeTokenizer Create(BpeOptions options);
+ public bool? ByteLevel { get; }
+ public string? BeginningOfSentenceToken { get; }
}
+ public sealed class BpeOptions
+ {
+ public BpeOptions(System.Collections.Generic.IEnumerable<(string, int)> vocabulary);
+
+ public string? BeginningOfSentenceToken { get; set; }
+ public string? ContinuingSubwordPrefix { get; set; }
+ public string? EndOfSentenceToken { get; set; }
+ public string? EndOfWordSuffix { get; set; }
+ public bool? FuseUnknownTokens { get; set; }
+ public IEnumerable<string>? Merges { get; set; }
+ public Normalizer? Normalizer { get; set; }
+ public PreTokenizer? PreTokenizer { get; set; }
+ public IReadOnlyDictionary<string, int> SpecialTokens { get; set; }
+ public string? UnknownToken { get; set; }
+ public IEnumerable<(string, int)> Vocabulary { get; }
+ public bool? ByteLevel { get; set; }
+ }Notes
- Added a new
ByteLevelproperty to enable BPE tokenizer support forByteLevel. This handles vocabularies stored as bytes (typically UTF-8 encoded) and ensures text is pre-tokenized accordingly. - Introduced
BeginningOfSentenceToken, an optional token that can be inserted at the start when encoding text. VocabandMergesare now passed asIEnumerableto provide flexibility, since these data sources may come from different origins.
SentencePieceTokenizer
We already have LlamaTokenizer.Create, with LlamaTokenizer subclassing SentencePieceTokenizer. Since SentencePieceTokenizer now supports multiple internal models (Bpe and Unigram), we should expose the Create method directly from SentencePieceTokenizer rather than exposing separate classes for each model. The model type is already embedded in the tokenizer file passed to Create.
public static new LlamaTokenizer.Create(Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false, IReadOnlyDictionary<string, int>? specialTokens = null)The proposal is to have the following Create method:
namespace Microsoft.ML.Tokenizers
{
public class SentencePieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
{
+ public static SentencePieceTokenizer Create(Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false, IReadOnlyDictionary<string, int> specialTokens = null);
}
}CompositePreTokenizer
A pre-tokenizer is used to split input text into smaller chunks before tokenization and encoding. In some scenarios, such as DeepSeek, multiple pre-tokenizers are required to run in sequence. To support this, the proposal is to introduce a CompositePreTokenizer, which implements the PreTokenizer abstraction.
namespace Microsoft.ML.Tokenizers
{
+ public class CompositePreTokenizer : PreTokenizer
+ {
+ public CompositePreTokenizer(IReadOnlyList<Tokenizers.PreTokenizer> preTokenizers, IReadOnlyDictionary<string, int> specialTokens = null);
+ public IReadOnlyList<PreTokenizer> PreTokenizers { get; }
public override IEnumerable<(int, int)> PreTokenize(System.ReadOnlySpan<char> text);
public override IEnumerable<(int, int)> PreTokenize(string text);
+ }
}