Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweak Tiktoken's BytePairEncode for improved perf #7017

Merged
merged 1 commit into from
Feb 20, 2024

Conversation

stephentoub
Copy link
Member

  • Stackalloc the indices/ranks when feasible
  • Use a span to eliminate bounds checks and allow for directly updating ranks
[Benchmark]
public int CountTokens() => _tokenizer.CountTokens(Poem);

with the same Poem as in #7012, and setting the LruCache size to 0 in order to skip the cache and measure what's being changed here...

Before:

Method Mean Allocated
CountTokens 61.11 us 19.52 KB

After:

Method Mean Allocated
CountTokens 58.82 us 11.27 KB

cc: @tarekgh

- Stackalloc the indices/ranks when feasible
- Use a span to eliminate bounds checks and allow for directly updating ranks
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (f976424) 68.81% compared to head (b50995c) 68.81%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7017      +/-   ##
==========================================
- Coverage   68.81%   68.81%   -0.01%     
==========================================
  Files        1258     1258              
  Lines      250643   250653      +10     
  Branches    25606    25608       +2     
==========================================
+ Hits       172479   172480       +1     
- Misses      71540    71546       +6     
- Partials     6624     6627       +3     
Flag Coverage Δ
Debug 68.81% <80.76%> (-0.01%) ⬇️
production 63.28% <80.76%> (-0.01%) ⬇️
test 88.44% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 88.23% <80.76%> (-6.60%) ⬇️

... and 3 files with indirect coverage changes

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tarekgh tarekgh merged commit 3282f44 into dotnet:main Feb 20, 2024
25 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants