Skip to content

Prototype using spans in Model #7018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

stephentoub
Copy link
Member

@tarekgh, this isn't for merging, but it shows appx what I had in mind for incorporating spans into Model (I know you're currently revising the surface area, so take this with a grain of salt). This eliminates a majority of the remaining allocation that occurs when using Tokenizer.CountTokens/EncodeToIds, as it avoids allocating strings for each token that's already in the cache.

Feel free to crib liberally from the second commit and close this PR. Ignore the first commit, which I submitted separately.

- Add a CancellationToken to CreateByModelNameAsync, allowing the download and parsing to be canceled.
- Use ReadLineAsync(cancellationToken), which not only allows it to be canceled, but avoids ~100K task allocations
- Fix Helpers.FromBase64String to support lines longer than 300 chars
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: 69 lines in your changes are missing coverage. Please review.

Comparison is base (f976424) 68.81% compared to head (e78ab0f) 68.81%.
Report is 6 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7018   +/-   ##
=======================================
  Coverage   68.81%   68.81%           
=======================================
  Files        1258     1259    +1     
  Lines      250643   250665   +22     
  Branches    25606    25608    +2     
=======================================
+ Hits       172479   172501   +22     
+ Misses      71540    71534    -6     
- Partials     6624     6630    +6     
Flag Coverage Δ
Debug 68.81% <62.50%> (+<0.01%) ⬆️
production 63.28% <62.50%> (+<0.01%) ⬆️
test 88.44% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...rc/Microsoft.ML.Tokenizers/Model/EnglishRoberta.cs 67.36% <100.00%> (ø)
...crosoft.ML.Tokenizers/Utils/Helpers.netstandard.cs 75.00% <100.00%> (+15.00%) ⬆️
src/Microsoft.ML.Tokenizers/Model/BPE.cs 75.29% <75.00%> (+0.29%) ⬆️
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 94.44% <94.44%> (ø)
src/Microsoft.ML.Tokenizers/Model/Model.cs 10.00% <50.00%> (+10.00%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Cache.cs 75.00% <76.92%> (+34.01%) ⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 82.64% <52.17%> (-0.97%) ⬇️
src/Microsoft.ML.Tokenizers/Utils/LruCache.cs 77.77% <64.70%> (+11.11%) ⬆️
src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs 54.92% <48.00%> (-0.64%) ⬇️

... and 6 files with indirect coverage changes

@tarekgh
Copy link
Member

tarekgh commented Feb 28, 2024

Closing this in favor of the following: #7035

@tarekgh tarekgh closed this Feb 28, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants