Fixed batch processing #21

jkrukowski · 2025-11-19T09:43:45Z

Key Changes

Added attention mask support: Tokenizer now returns BatchTokenizeResult struct containing both tokens and attention masks
Implemented masked mean pooling: Padding tokens are now correctly excluded from pooling calculations
Updated all embedding models: Bert, CLIP, ModernBert, Roberta, and XLMRoberta now use attention masks in forward pass
Refactored accuracy tests: Split monolithic test file into focused per-model test suites with batch accuracy tests
Enhanced test infrastructure: Added shared utilities and updated Python generation script for batch testing

tokenizeTextsPaddingToLongest methods now return BatchTokenizeResult instead of [[Int32]]

co-author: @dang-hai

Co-authored-by: dang-hai <dan.duonghai@gmail.com>

Fixed batch processing

e4b223b

Co-authored-by: dang-hai <dan.duonghai@gmail.com>

jkrukowski mentioned this pull request Nov 19, 2025

Fix: Add attention mask to batchEncode() to prevent padding corruption #20

Open

jkrukowski merged commit 63b5f21 into main Nov 21, 2025

jkrukowski deleted the jankrukowski/fix-batch-processing branch November 21, 2025 08:31