Batch Tokenization Support

**Is your feature request related to a problem? Please describe.**
Most AI systems use batching for performance reasons, needing all tokenized sentences being the same size and outputting a mask of which values are padding.
In [my project](https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs) I had to implement this myself. The issues are mostly performance and API compatibility with the ecosystem. 
With my solution - There are megabytes of allocations:

![Image](https://github.com/user-attachments/assets/be2a8b84-6809-48ad-8833-c096943607e8)

The Int64[] allocations are due to widening needed to be done since the ONNX model needs Tensor<long> as input.
The int32[] allocations are the actual tokens.
The strings allocations are token strings that are not used and are thrown away.
The other allocations are internal, and I don't know what they are. 

**Describe the solution you'd like**
Enable 0 allocations solution via an API like the following:

```csharp
class Tokenizer
{
     ...
     public abstract void BatchTokenize<T, K>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inoutIds, Tensor<T> inputMask) 
              where T: INumber<T>;
     
     public abstract void BatchTokenize<T>(ReadOnlySpan<string> texts,  int maxTokenCount, Tensor<T> inputIds, Tensor<T> inputMask, Tensor<T> tokenTypeIds) 
              where T: INumber<T>;
}
```
Maybe instead of Tensor<T> you want to use TensorSpan<T>? 

Where the string allocations are removed if not needed, and the other internal allocations optimized.
This API will enable me to pool Tensors, and removing casting from the `int` to `long` for my models.

**Describe alternatives you've considered**
I have implemented my own batch tokenizer: https://github.com/tjwald/high-perf-ML/blob/develop/ML.Infra/Tokenization/PretrainedTokenizer.cs.

**Additional context**
Continuing this ticket: https://github.com/microsoft/semantic-kernel/issues/9793 on the tokenization part.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch Tokenization Support #7371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch Tokenization Support #7371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions