Address the feedback on the tokenizer's library#7024
Conversation
…rt the cases in the Model abstraction
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7024 +/- ##
==========================================
- Coverage 68.83% 68.79% -0.04%
==========================================
Files 1258 1254 -4
Lines 250672 250204 -468
Branches 25615 25529 -86
==========================================
- Hits 172547 172125 -422
+ Misses 71493 71468 -25
+ Partials 6632 6611 -21
Flags with carried forward coverage won't be shown. Click here to find out more.
|
| private Cache<(string, string), int> GetMergeRanks(Stream mergeStream) | ||
| { | ||
| var mergeRanks = new Dictionary<(string, string), int>(); | ||
| var mergeRanks = new Cache<(string, string), int>(60_000); |
There was a problem hiding this comment.
Where does this 60k come from?
There was a problem hiding this comment.
The loaded data from the merge file is 50K. I give it 10K more to grow.
| /// <param name="continuingSubwordPrefix">The prefix to attach to sub-word units that don’t represent a beginning of word.</param> | ||
| /// <param name="endOfWordSuffix">The suffix to attach to sub-word units that represent an end of word.</param> | ||
| public Bpe(string vocabFile, string? mergesFile, string? unknownToken = null, string? continuingSubwordPrefix = null, string? endOfWordSuffix = null) : | ||
| /// <param name="fuseUnknownTokens">Indicate whether allowing multiple unknown tokens get fused.</param> |
There was a problem hiding this comment.
I'm having trouble understanding what this means.
There was a problem hiding this comment.
If encoding text with Bpe model, for the tokens that the model doesn't recognize, it uses the unknown token for it. Most users uses [Unk] for the unknow token. It is possible to get multiple [Unk] tokens next to each others in the result. Settings fuseUnknownTokens to true cause all [Unk] sequence to collapse into one [Ukn]. Fuse term is used by Huggingface and users of Bpe are familiar with that. If you have better explanation we can use here I'll be happy to use it :-)
This fix address the feedback reported in the issues: