-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Introducing Tiktoken Tokenizer #6981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #6981 +/- ##
==========================================
+ Coverage 68.80% 68.81% +0.01%
==========================================
Files 1249 1256 +7
Lines 249686 250425 +739
Branches 25485 25569 +84
==========================================
+ Hits 171795 172335 +540
- Misses 71294 71466 +172
- Partials 6597 6624 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
|
return true; | ||
} | ||
|
||
int[] encodedIds = BytePairEncoder.BytePairEncode(Encoding.UTF8.GetBytes(sequence), _encoder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be really nice to reduce the overheads here. It can be done separately, but this is a lot of allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked through the issue #6989
} | ||
} | ||
|
||
return utf8Bytes.Count > 0 ? Encoding.UTF8.GetString(utf8Bytes.ToArray()) : string.Empty; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only target netstandard2.0, or do we multitarget and build this for netcoreapp as well? There are newer APIs that make this cheaper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked through the issue #6989
src/Microsoft.ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.ML.Tokenizers/PreTokenizer/TikTokenPreTokenizer.cs
Outdated
Show resolved
Hide resolved
return outList; | ||
} | ||
|
||
private static T[] Slice<T>(this T[] array, int start, int end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There looks to be a fair amount of allocation being incurred from all this slicing. That can't be reduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tracked this in the issue #6989
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just that one question about the empty dispose (though I saw it in a couple other places too, my question applies there too)
This modification introduces support for the Tiktoken tokenizer into the Microsoft ML tokenizers library. The logic is largely derived from the Microsoft Tokenizers Library, and the update includes optimizations and adjustments to the public APIs. Further refinements for the APIs are pending and are being tracked through issue #6982.
Usage
APIs changes