Fix: Add attention mask to batchEncode() to prevent padding corruption #20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The
batchEncode()method pads all texts to the same length usingtokenizeTextsPaddingToLongest(), but does not create an attention mask to tell the model which tokens are padding. This causes the model to process padding tokens as if they were real content, corrupting the embeddings for shorter texts in a batch.Root Cause
In BertModel.swift:393-406, the
batchEncode()method creates input IDs from padded tokens but doesn't pass an attention mask:Solution
Create an attention mask tensor (1 for real tokens, 0 for padding) and pass it to the model:
Test Evidence
What the Tests Validate
Test 1:
testBatchEncodeMatchesIndividualEncode()batchEncode()produces identical embeddings to individualencode()callsTest 2:
testSemanticSearchRanking()BEFORE Fix (Broken Behavior)
Shorter texts fail because padding is processed as real content:
Key observation: The bug severity correlates with padding amount - more padding = more corruption.
AFTER Fix (Correct Behavior)
All texts produce identical embeddings regardless of length: