Fix: Add attention mask to batchEncode() to prevent padding corruption #20

dang-hai · 2025-11-15T17:40:36Z

Problem

The batchEncode() method pads all texts to the same length using tokenizeTextsPaddingToLongest(), but does not create an attention mask to tell the model which tokens are padding. This causes the model to process padding tokens as if they were real content, corrupting the embeddings for shorter texts in a batch.

Root Cause

In BertModel.swift:393-406, the batchEncode() method creates input IDs from padded tokens but doesn't pass an attention mask:

public func batchEncode(
    _ texts: [String],
    padTokenId: Int = 0,
    maxLength: Int = 512
) throws -> MLTensor {
    let encodedTexts = try tokenizer.tokenizeTextsPaddingToLongest(
        texts, padTokenId: padTokenId, maxLength: maxLength)

    let inputIds = MLTensor(
        shape: [batchSize, seqLength],
        scalars: encodedTexts.flatMap { $0 })

    let result = model(inputIds: inputIds)  // ❌ No attention mask!
    return result.sequenceOutput[0..., 0, 0...]
}

Solution

Create an attention mask tensor (1 for real tokens, 0 for padding) and pass it to the model:

// Create attention mask: 1 for real tokens, 0 for padding
var attentionMaskScalars: [Float] = []
for tokens in encodedTexts {
    for token in tokens {
        attentionMaskScalars.append(token == padTokenId ? 0.0 : 1.0)
    }
}
let attentionMask = MLTensor(
    shape: [batchSize, seqLength],
    scalars: attentionMaskScalars,
    scalarType: Float.self)

let result = model(inputIds: inputIds, attentionMask: attentionMask)  // ✅ With mask!

Test Evidence

What the Tests Validate

Test 1: testBatchEncodeMatchesIndividualEncode()

Validates: batchEncode() produces identical embeddings to individual encode() calls
Method: Compares first 10 values of 384-dimensional embeddings element-by-element
Test texts: 4 texts with varying lengths (1, 15, 3, 15 words) to test different padding amounts
Success criteria:
- Max element difference < 0.0001 across all 384 dimensions
- Cosine similarity > 0.9999 (vectors are nearly identical)

Test 2: testSemanticSearchRanking()

Validates: Semantic search returns correct ranking with batch encoding
Method: Encode query individually, encode documents in batch, rank by cosine similarity
Test case: Query about neural networks should rank "Neural networks learn by adjusting weights through backpropagation" first
Success criteria: Correct document ranked macOS 14 support #1

BEFORE Fix (Broken Behavior)

Shorter texts fail because padding is processed as real content:

[0] "Hello" (1 word - heavily padded)
    Individual first 10: 0.024392, 0.006443, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025229
    Batch first 10:      0.044161, 0.027714, 0.061734, 0.017376, 0.005989, 0.006635, -0.007453, -0.027788, -0.012345, -0.012450
    Max element difference: 0.10064870
    Cosine similarity:      0.80521619
    ❌ FAIL - Vectors differ by 10% in some dimensions

[2] "Good morning everyone" (3 words - moderately padded)
    Individual first 10: -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020570, -0.001522, -0.009269, 0.012590, -0.046033
    Batch first 10:      0.016885, 0.045391, 0.059252, 0.012093, 0.026224, 0.045280, -0.008135, 0.000815, -0.007180, -0.023155
    Max element difference: 0.08880508
    Cosine similarity:      0.83390623
    ❌ FAIL - Vectors differ by 8% in some dimensions

[1] "The quick brown fox..." (15 words - longest text, minimal padding)
    ✅ PASS - Cosine similarity 0.9999 (works by coincidence because it's the longest)

Key observation: The bug severity correlates with padding amount - more padding = more corruption.

AFTER Fix (Correct Behavior)

All texts produce identical embeddings regardless of length:

[0] "Hello" (1 word)
    Individual first 10: 0.024392, 0.006443, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025229
    Batch first 10:      0.024392, 0.006444, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025230
    Max element difference: 0.00000012
    Cosine similarity:      1.00000012
    ✅ PASS - Differences are only floating-point precision errors

[2] "Good morning everyone" (3 words)
    Individual first 10: -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020570, -0.001522, -0.009269, 0.012590, -0.046033
    Batch first 10:      -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020569, -0.001522, -0.009269, 0.012590, -0.046033
    Max element difference: 0.00000018
    Cosine similarity:      0.99999988
    ✅ PASS - Vectors match within numerical precision

✅ SUCCESS: All vectors match (all 4 test texts pass)

## The Bug The `batchEncode()` method pads texts to the same length but doesn't create an attention mask, causing the model to process padding tokens as real content. This corrupts embeddings for shorter texts in a batch. ## The Fix - Create attention mask tensor (1 for real tokens, 0 for padding) - Pass attention mask to model so it ignores padding tokens - Ensures batchEncode() produces identical results to individual encode() calls ## Test Evidence Added comprehensive test suite demonstrating the fix with varied text lengths.

jkrukowski · 2025-11-19T09:45:14Z

Hi @dang-hai. Thanks for spotting the issue. I've decided to create a more comprehensive PR #21 lmk what you think

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Add attention mask to batchEncode() to prevent padding corruption #20

Fix: Add attention mask to batchEncode() to prevent padding corruption #20

Uh oh!

dang-hai commented Nov 15, 2025

Uh oh!

jkrukowski commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Add attention mask to batchEncode() to prevent padding corruption #20

Are you sure you want to change the base?

Fix: Add attention mask to batchEncode() to prevent padding corruption #20

Uh oh!

Conversation

dang-hai commented Nov 15, 2025

Problem

Root Cause

Solution

Test Evidence

What the Tests Validate

BEFORE Fix (Broken Behavior)

AFTER Fix (Correct Behavior)

Uh oh!

jkrukowski commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants