Address the feedback on the tokenizer's library #7024

tarekgh · 2024-02-23T00:07:11Z

This fix address the feedback reported in the issues:

tarekgh · 2024-02-23T00:29:36Z

CC @ericstj @michaelgsharp @luisquintanilla @LittleLittleCloud

…rt the cases in the Model abstraction

…a parameters

…er data

codecov · 2024-02-23T02:44:53Z

Codecov Report

Attention: Patch coverage is 76.65953% with 109 lines in your changes are missing coverage. Please review.

Project coverage is 68.79%. Comparing base (4b89d98) to head (1ad157f).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7024      +/-   ##
==========================================
- Coverage   68.83%   68.79%   -0.04%     
==========================================
  Files        1258     1254       -4     
  Lines      250672   250204     -468     
  Branches    25615    25529      -86     
==========================================
- Hits       172547   172125     -422     
+ Misses      71493    71468      -25     
+ Partials     6632     6611      -21

Flag	Coverage Δ
Debug	`68.79% <76.65%> (-0.04%)`	⬇️
production	`63.22% <66.24%> (-0.05%)`	⬇️
test	`88.50% <98.66%> (-0.07%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
src/Microsoft.ML.Tokenizers/EncodingResult.cs	`98.41% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/Model/Word.cs	`58.75% <100.00%> (-25.63%)`	⬇️
...ft.ML.Tokenizers/Normalizer/LowerCaseNormalizer.cs	`100.00% <100.00%> (ø)`
...ft.ML.Tokenizers/Normalizer/UpperCaseNormalizer.cs	`100.00% <100.00%> (ø)`
...ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs	`90.24% <100.00%> (ø)`
...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs	`100.00% <100.00%> (ø)`
src/Microsoft.ML.Tokenizers/Token.cs	`100.00% <100.00%> (ø)`
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs	`88.23% <ø> (ø)`
...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs	`87.50% <100.00%> (ø)`
src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs	`91.10% <100.00%> (ø)`
... and 14 more

... and 7 files with indirect coverage changes

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

src/Microsoft.ML.Tokenizers/Model/Cache.cs

stephentoub · 2024-02-23T14:48:16Z

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

        {
-            var mergeRanks = new Dictionary<(string, string), int>();
+            var mergeRanks = new Cache<(string, string), int>(60_000);


Where does this 60k come from?

The loaded data from the merge file is 50K. I give it 10K more to grow.

src/Microsoft.ML.Tokenizers/Model/BpeTrainer.cs

stephentoub · 2024-02-23T14:52:31Z

src/Microsoft.ML.Tokenizers/Model/BPE.cs

@@ -77,9 +77,10 @@ public sealed class Bpe : Model
        /// <param name="unknownToken"> The unknown token to be used by the model.</param>
        /// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
        /// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
-        public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) :
+        /// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>


I'm having trouble understanding what this means.

If encoding text with Bpe model, for the tokens that the model doesn't recognize, it uses the unknown token for it. Most users uses [Unk] for the unknow token. It is possible to get multiple [Unk] tokens next to each others in the result. Settings fuseUnknownTokens to true cause all [Unk] sequence to collapse into one [Ukn]. Fuse term is used by Huggingface and users of Bpe are familiar with that. If you have better explanation we can use here I'll be happy to use it :-)

src/Microsoft.ML.Tokenizers/Model/BPE.cs

src/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs

src/Microsoft.ML.Tokenizers/Model/Model.cs

src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs

test/Microsoft.ML.Tokenizers.Tests/Microsoft.ML.Tokenizers.Tests.csproj

src/Microsoft.ML.Tokenizers/Model/Model.cs

src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs

tarekgh requested a review from stephentoub February 23, 2024 00:07

dotnet-policy-service bot assigned tarekgh Feb 23, 2024

tarekgh added the Tokenizers label Feb 23, 2024

tarekgh added 22 commits February 22, 2024 16:41

Fix cache when calling EncodeToIds

f6e32f5

Make EnglishRoberta _mergeRanks thread safe

0553922

Delete Trainer

a4cb1f5

Remove the setters on the Bpe properties

6a13025

Remove Roberta and Tiktoken special casing in the Tokenizer and suppo…

3278aff

…rt the cases in the Model abstraction

Support text-embedding-3-small/large embedding

b5f7fa2

Remove redundant TokenToId abstraction and keep the one with the extr…

a11f4e0

…a parameters

Enable creating Tiktoken asynchronously or directly using the tokeniz…

865068a

…er data

Add cancellationToken support in CreateAsync APIs

4077de0

Rename sequence to text and Tokenize to Encode

5aaf849

Rename skipSpecialTokens to considerSpecialTokens

b5e0927

Rename TokenizerResult to EncodingResult

5e26b3e

Make Token publicly immutable

985de8a

Change offset tuples from (Index, End) to (Index, Length)

b551e7d

Rename NormalizedString method's parameters

7ea7f70

Rename Model's methods to start with verb

b0c8244

Convert Model.GetVocab() method to a Vocab property

450418a

Some method's parameters and variable renaming

6f53de8

Remove Vocab and VocabSize from the abstraction

62334c6

Cleanup normalization support

d48b32d

Minor Bpe cleanup

191ab03

Resolve rebase change

b9b0f58

tarekgh force-pushed the MiscFixes branch from ab8d54b to b9b0f58 Compare February 23, 2024 00:49