Skip to content

Batch Tokenization Support #7371

Open
Open
@tjwald

Description

@tjwald

Is your feature request related to a problem? Please describe.
Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding.
In my project I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem.
With my solution - There are megabytes of allocations:

Image

The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor as input.
The int32[] allocations are the actual tokens.
The strings allocations are token strings that are not used and are thrown away.
The other allocations are internal, and I don't know what they are.

Describe the solution you'd like
Enable 0 allocations solution via an API like the following:

class Tokenizer
{
     ...
     public abstract void BatchTokenize<T, K>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inoutIds, Tensor<T> inputMask) 
              where T: INumber<T>;
     
     public abstract void BatchTokenize<T>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inputIds, Tensor<T> inputMask, Tensor<T> tokenTypeIds) 
              where T: INumber<T>;
}

Maybe instead of Tensor you want to use TensorSpan?

Where the string allocations are removed if not needed, and the other internal allocations optimized.
This API will enable me to pool Tensors, and removing casting from the int to long for my models.

Describe alternatives you've considered
I have implemented my own batch tokenizer: https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs.

Additional context
Continuing this ticket: microsoft/semantic-kernel#9793 on the tokenization part.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions