Skip to content

Context Aware Tokenization #7372

Open
Open
@tjwald

Description

@tjwald

Is your feature request related to a problem? Please describe.
Continuing this thread from microsoft/semantic-kernel#9793 .
Migrating from python to C# is difficult enough, supporting the same feature set but being more efficient is important for the migration story.
One feature that is currently missing is context aware tokenization.

In python, HuggingFace tokenizers support this:

tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')

context = 'some context'
sentence = 'some sentence'

tokens = tokenizer(context, sentence)

Where the tokenizer will tokenize the sentence with the given context.

Describe the solution you'd like
Add new API (including batch API) that support this:

class Tokenizer
{
     public IReadOnlyList<int> EncodeToIds(string context, string text, bool considerPreTokenization = true, bool considerNormalization = true)
     ...
     // relating to: https://github.com/dotnet/machinelearning/issues/7371
     public void BatchTokenize<T, K>(ReadOnlySpan<string> contexts, ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> tokenIds, Tensor<K> mask) 
              where T: INumber<T>, K: INumber<K>
}

Describe alternatives you've considered
Study the inner working of different tokenizers in the python implementation and see how to port the same functionality.

Additional context
Continuing the thread related to tokenizers from microsoft/semantic-kernel#9793

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions