Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the feedback on the tokenizer's library #7024

Merged
merged 23 commits into from
Feb 26, 2024
Merged

Conversation

@tarekgh
Copy link
Member Author

tarekgh commented Feb 23, 2024

Copy link

codecov bot commented Feb 23, 2024

Codecov Report

Attention: Patch coverage is 76.65953% with 109 lines in your changes are missing coverage. Please review.

Project coverage is 68.79%. Comparing base (4b89d98) to head (1ad157f).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7024      +/-   ##
==========================================
- Coverage   68.83%   68.79%   -0.04%     
==========================================
  Files        1258     1254       -4     
  Lines      250672   250204     -468     
  Branches    25615    25529      -86     
==========================================
- Hits       172547   172125     -422     
+ Misses      71493    71468      -25     
+ Partials     6632     6611      -21     
Flag Coverage Δ
Debug 68.79% <76.65%> (-0.04%) ⬇️
production 63.22% <66.24%> (-0.05%) ⬇️
test 88.50% <98.66%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/Microsoft.ML.Tokenizers/EncodingResult.cs 98.41% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/Model/Word.cs 58.75% <100.00%> (-25.63%) ⬇️
...ft.ML.Tokenizers/Normalizer/LowerCaseNormalizer.cs 100.00% <100.00%> (ø)
...ft.ML.Tokenizers/Normalizer/UpperCaseNormalizer.cs 100.00% <100.00%> (ø)
...ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs 90.24% <100.00%> (ø)
...Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.Tokenizers/Token.cs 100.00% <100.00%> (ø)
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 88.23% <ø> (ø)
...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs 87.50% <100.00%> (ø)
src/Microsoft.ML.TorchSharp/NasBert/NerTrainer.cs 91.10% <100.00%> (ø)
... and 14 more

... and 7 files with indirect coverage changes

{
var mergeRanks = new Dictionary<(string, string), int>();
var mergeRanks = new Cache<(string, string), int>(60_000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this 60k come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loaded data from the merge file is 50K. I give it 10K more to grow.

@@ -77,9 +77,10 @@ public sealed class Bpe : Model
/// <param name="unknownToken"> The unknown token to be used by the model.</param>
/// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param>
/// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param>
public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) :
/// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what this means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If encoding text with Bpe model, for the tokens that the model doesn't recognize, it uses the unknown token for it. Most users uses [Unk] for the unknow token. It is possible to get multiple [Unk] tokens next to each others in the result. Settings fuseUnknownTokens to true cause all [Unk] sequence to collapse into one [Ukn]. Fuse term is used by Huggingface and users of Bpe are familiar with that. If you have better explanation we can use here I'll be happy to use it :-)

@tarekgh tarekgh merged commit d0aa2c2 into dotnet:main Feb 26, 2024
25 checks passed
@ericstj ericstj mentioned this pull request Feb 12, 2024
4 tasks
@github-actions github-actions bot locked and limited conversation to collaborators Mar 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants