Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address the feedback on the tokenizer's library #7024

Merged
merged 23 commits into from
Feb 26, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
f6e32f5
Fix cache when calling EncodeToIds
tarekgh Feb 17, 2024
0553922
Make EnglishRoberta _mergeRanks thread safe
tarekgh Feb 17, 2024
a4cb1f5
Delete Trainer
tarekgh Feb 19, 2024
6a13025
Remove the setters on the Bpe properties
tarekgh Feb 19, 2024
3278aff
Remove Roberta and Tiktoken special casing in the Tokenizer and suppo…
tarekgh Feb 19, 2024
b5f7fa2
Support text-embedding-3-small/large embedding
tarekgh Feb 19, 2024
a11f4e0
Remove redundant TokenToId abstraction and keep the one with the extr…
tarekgh Feb 19, 2024
865068a
Enable creating Tiktoken asynchronously or directly using the tokeniz…
tarekgh Feb 20, 2024
4077de0
Add cancellationToken support in CreateAsync APIs
tarekgh Feb 21, 2024
5aaf849
Rename sequence to text and Tokenize to Encode
tarekgh Feb 21, 2024
b5e0927
Rename skipSpecialTokens to considerSpecialTokens
tarekgh Feb 21, 2024
5e26b3e
Rename TokenizerResult to EncodingResult
tarekgh Feb 21, 2024
985de8a
Make Token publicly immutable
tarekgh Feb 21, 2024
b551e7d
Change offset tuples from (Index, End) to (Index, Length)
tarekgh Feb 21, 2024
7ea7f70
Rename NormalizedString method's parameters
tarekgh Feb 21, 2024
b0c8244
Rename Model's methods to start with verb
tarekgh Feb 21, 2024
450418a
Convert Model.GetVocab() method to a Vocab property
tarekgh Feb 21, 2024
6f53de8
Some method's parameters and variable renaming
tarekgh Feb 22, 2024
62334c6
Remove Vocab and VocabSize from the abstraction
tarekgh Feb 22, 2024
d48b32d
Cleanup normalization support
tarekgh Feb 22, 2024
191ab03
Minor Bpe cleanup
tarekgh Feb 22, 2024
b9b0f58
Resolve rebase change
tarekgh Feb 23, 2024
1ad157f
Address the feedback
tarekgh Feb 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,16 @@ namespace Microsoft.ML.Tokenizers
/// <summary>
/// The Encoding represents the output of a Tokenizer.
/// </summary>
public sealed class TokenizerResult
public sealed class EncodingResult
{
/// <summary>
/// Create a new object of the TokenizerResult object.
/// Create a new object of the EncodingResult object.
/// </summary>
/// <param name="originalString">The list of tokens to merge.</param>
/// <param name="normalizedString">The list of tokens to merge.</param>
/// <param name="splits">The list of tokens to merge.</param>
/// <param name="offsetsMappedToOriginalString">Indicate whether the offsets is mapped to the original string or the normalized string.</param>
public TokenizerResult(string originalString, string normalizedString, IEnumerable<Split> splits, bool offsetsMappedToOriginalString)
public EncodingResult(string originalString, string normalizedString, IEnumerable<Split> splits, bool offsetsMappedToOriginalString)
{
OriginalString = originalString;
NormalizedString = normalizedString;
Expand All @@ -47,7 +47,7 @@ public TokenizerResult(string originalString, string normalizedString, IEnumerab
private List<Token>? _tokens;
private List<string>? _tokensWords;
private List<int>? _ids;
private List<(int Index, int End)>? _offsets;
private List<(int Index, int Length)>? _offsets;

internal void AddTokens(IReadOnlyList<Token> addedTokens)
{
Expand Down Expand Up @@ -121,10 +121,10 @@ public IReadOnlyList<string> Tokens
}

/// <summary>
/// Gets The list of offsets. These offsets lets you slice the input string, and thus retrieve
/// Gets The list of offsets. These offsets let's you slice the input string, and thus retrieve
/// the original part that led to producing the corresponding token.
/// </summary>
public IReadOnlyList<(int Index, int End)> Offsets
public IReadOnlyList<(int Index, int Length)> Offsets
{
get
{
Expand All @@ -138,7 +138,7 @@ public IReadOnlyList<string> Tokens
return Array.Empty<(int, int)>();
}

_offsets = new List<(int Index, int End)>(_tokens.Count);
_offsets = new List<(int Index, int Length)>(_tokens.Count);

foreach (var token in _tokens)
{
Expand Down
265 changes: 130 additions & 135 deletions src/Microsoft.ML.Tokenizers/Model/BPE.cs

Large diffs are not rendered by default.

Loading