Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the feedback on the tokenizer's library #7024

Merged
merged 23 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f6e32f5
Fix cache when calling EncodeToIds
tarekgh Feb 17, 2024
0553922
Make EnglishRoberta _mergeRanks thread safe
tarekgh Feb 17, 2024
a4cb1f5
Delete Trainer
tarekgh Feb 19, 2024
6a13025
Remove the setters on the Bpe properties
tarekgh Feb 19, 2024
3278aff
Remove Roberta and Tiktoken special casing in the Tokenizer and suppo…
tarekgh Feb 19, 2024
b5f7fa2
Support text-embedding-3-small/large embedding
tarekgh Feb 19, 2024
a11f4e0
Remove redundant TokenToId abstraction and keep the one with the extr…
tarekgh Feb 19, 2024
865068a
Enable creating Tiktoken asynchronously or directly using the tokeniz…
tarekgh Feb 20, 2024
4077de0
Add cancellationToken support in CreateAsync APIs
tarekgh Feb 21, 2024
5aaf849
Rename sequence to text and Tokenize to Encode
tarekgh Feb 21, 2024
b5e0927
Rename skipSpecialTokens to considerSpecialTokens
tarekgh Feb 21, 2024
5e26b3e
Rename TokenizerResult to EncodingResult
tarekgh Feb 21, 2024
985de8a
Make Token publicly immutable
tarekgh Feb 21, 2024
b551e7d
Change offset tuples from (Index, End) to (Index, Length)
tarekgh Feb 21, 2024
7ea7f70
Rename NormalizedString method's parameters
tarekgh Feb 21, 2024
b0c8244
Rename Model's methods to start with verb
tarekgh Feb 21, 2024
450418a
Convert Model.GetVocab() method to a Vocab property
tarekgh Feb 21, 2024
6f53de8
Some method's parameters and variable renaming
tarekgh Feb 22, 2024
62334c6
Remove Vocab and VocabSize from the abstraction
tarekgh Feb 22, 2024
d48b32d
Cleanup normalization support
tarekgh Feb 22, 2024
191ab03
Minor Bpe cleanup
tarekgh Feb 22, 2024
b9b0f58
Resolve rebase change
tarekgh Feb 23, 2024
1ad157f
Address the feedback
tarekgh Feb 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions src/Microsoft.ML.Tokenizers/Model/BPE.cs
Original file line number Diff line number Diff line change
Expand Up @@ -56,17 +56,17 @@ public string? UnknownToken
/// <summary>
/// A prefix to be used for every subword that is not a beginning-of-word
/// </summary>
public string? ContinuingSubwordPrefix { get; set; }
public string? ContinuingSubwordPrefix { get; private set; }

/// <summary>
/// An optional suffix to characterize and end-of-word sub-word
/// </summary>
public string? EndOfWordSuffix { get; set; }
public string? EndOfWordSuffix { get; private set; }

/// <summary>
/// Gets or sets whether allowing multiple unknown tokens get fused
/// </summary>
public bool FuseUnknownTokens { get; set; }
public bool FuseUnknownTokens { get; private set; }


/// <summary>
Expand All @@ -77,9 +77,10 @@ public string? UnknownToken
/// <param name="unknownToken"> The unknown token to be used by the model.</param>
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) :
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what this means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If encoding text with Bpe model, for the tokens that the model doesn't recognize, it uses the unknown token for it. Most users uses [Unk] for the unknow token. It is possible to get multiple [Unk] tokens next to each others in the result. Settings fuseUnknownTokens to true cause all [Unk] sequence to collapse into one [Ukn]. Fuse term is used by Huggingface and users of Bpe are familiar with that. If you have better explanation we can use here I'll be happy to use it :-)

public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null, bool fuseUnknownTokens = false) :
this(vocabFile is null ? throw new ArgumentNullException(nameof(vocabFile)) : File.Open(vocabFile, FileMode.Open, FileAccess.Read),
mergesFile is null ? null : File.Open(mergesFile, FileMode.Open, FileAccess.Read), unknownToken, continuingSubwordPrefix, endOfWordSuffix, disposeStreams: true)
mergesFile is null ? null : File.Open(mergesFile, FileMode.Open, FileAccess.Read), unknownToken, continuingSubwordPrefix, endOfWordSuffix, fuseUnknownTokens, disposeStreams: true)
{
}

Expand All @@ -91,19 +92,22 @@ public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, st
/// <param name="unknownToken"> The unknown token to be used by the model.</param>
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
public Bpe(Stream vocabStream, Stream? mergesStream, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) :
this(vocabStream, mergesStream, unknownToken, continuingSubwordPrefix, endOfWordSuffix, disposeStreams: false)
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>
public Bpe(Stream vocabStream, Stream? mergesStream, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null, bool fuseUnknownTokens = false) :
this(vocabStream, mergesStream, unknownToken, continuingSubwordPrefix, endOfWordSuffix, fuseUnknownTokens, disposeStreams: false)
{
}

private Bpe(Stream vocabStream, Stream? mergesStream, string? unknownToken, string? continuingSubwordPrefix, string? endOfWordSuffix, bool disposeStreams)
private Bpe(Stream vocabStream, Stream? mergesStream, string? unknownToken, string? continuingSubwordPrefix, string? endOfWordSuffix, bool fuseUnknownTokens, bool disposeStreams)
{
try
{
if (vocabStream is null)
{
throw new ArgumentNullException(nameof(vocabStream));
}

FuseUnknownTokens = fuseUnknownTokens;
ContinuingSubwordPrefix = continuingSubwordPrefix;
EndOfWordSuffix = endOfWordSuffix;

Expand Down
3 changes: 1 addition & 2 deletions test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,7 @@ public void SimpleTestWithUnknownToken(Dictionary<string, int> vocab, (string, s

try
{
Bpe bpe = new Bpe(vocabFile, mergesFile, UnknownToken);
bpe.FuseUnknownTokens = fuseUnknownToken;
Bpe bpe = new Bpe(vocabFile, mergesFile, UnknownToken, null, null, fuseUnknownToken);

Assert.Equal(vocab.Count + 1u, bpe.GetVocabSize());
Tokenizer tokenizer = new Tokenizer(bpe);
Expand Down