Description
Is your feature request related to a problem? Please describe.
Continuing this thread from microsoft/semantic-kernel#9793 .
Migrating from python to C# is difficult enough, supporting the same feature set but being more efficient is important for the migration story.
One feature that is currently missing is context aware tokenization.
In python, HuggingFace tokenizers support this:
tokenizer = AutoTokenizer.from_pretrained('/path/to/tokenizer')
context = 'some context'
sentence = 'some sentence'
tokens = tokenizer(context, sentence)
Where the tokenizer will tokenize the sentence with the given context.
Describe the solution you'd like
Add new API (including batch API) that support this:
class Tokenizer
{
public IReadOnlyList<int> EncodeToIds(string context, string text, bool considerPreTokenization = true, bool considerNormalization = true)
...
// relating to: https://github.com/dotnet/machinelearning/issues/7371
public void BatchTokenize<T, K>(ReadOnlySpan<string> contexts, ReadOnlySpan<string> texts, int maxTokenCount, Tensor<T> tokenIds, Tensor<K> mask)
where T: INumber<T>, K: INumber<K>
}
Describe alternatives you've considered
Study the inner working of different tokenizers in the python implementation and see how to port the same functionality.
Additional context
Continuing the thread related to tokenizers from microsoft/semantic-kernel#9793