Skip to content

Conversation

@dang-hai
Copy link
Contributor

Problem

The batchEncode() method pads all texts to the same length using tokenizeTextsPaddingToLongest(), but does not create an attention mask to tell the model which tokens are padding. This causes the model to process padding tokens as if they were real content, corrupting the embeddings for shorter texts in a batch.

Root Cause

In BertModel.swift:393-406, the batchEncode() method creates input IDs from padded tokens but doesn't pass an attention mask:

public func batchEncode(
    _ texts: [String],
    padTokenId: Int = 0,
    maxLength: Int = 512
) throws -> MLTensor {
    let encodedTexts = try tokenizer.tokenizeTextsPaddingToLongest(
        texts, padTokenId: padTokenId, maxLength: maxLength)

    let inputIds = MLTensor(
        shape: [batchSize, seqLength],
        scalars: encodedTexts.flatMap { $0 })

    let result = model(inputIds: inputIds)  // ❌ No attention mask!
    return result.sequenceOutput[0..., 0, 0...]
}

Solution

Create an attention mask tensor (1 for real tokens, 0 for padding) and pass it to the model:

// Create attention mask: 1 for real tokens, 0 for padding
var attentionMaskScalars: [Float] = []
for tokens in encodedTexts {
    for token in tokens {
        attentionMaskScalars.append(token == padTokenId ? 0.0 : 1.0)
    }
}
let attentionMask = MLTensor(
    shape: [batchSize, seqLength],
    scalars: attentionMaskScalars,
    scalarType: Float.self)

let result = model(inputIds: inputIds, attentionMask: attentionMask)  // ✅ With mask!

Test Evidence

What the Tests Validate

Test 1: testBatchEncodeMatchesIndividualEncode()

  • Validates: batchEncode() produces identical embeddings to individual encode() calls
  • Method: Compares first 10 values of 384-dimensional embeddings element-by-element
  • Test texts: 4 texts with varying lengths (1, 15, 3, 15 words) to test different padding amounts
  • Success criteria:
    • Max element difference < 0.0001 across all 384 dimensions
    • Cosine similarity > 0.9999 (vectors are nearly identical)

Test 2: testSemanticSearchRanking()

  • Validates: Semantic search returns correct ranking with batch encoding
  • Method: Encode query individually, encode documents in batch, rank by cosine similarity
  • Test case: Query about neural networks should rank "Neural networks learn by adjusting weights through backpropagation" first
  • Success criteria: Correct document ranked macOS 14 support #1

BEFORE Fix (Broken Behavior)

Shorter texts fail because padding is processed as real content:

[0] "Hello" (1 word - heavily padded)
    Individual first 10: 0.024392, 0.006443, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025229
    Batch first 10:      0.044161, 0.027714, 0.061734, 0.017376, 0.005989, 0.006635, -0.007453, -0.027788, -0.012345, -0.012450
    Max element difference: 0.10064870
    Cosine similarity:      0.80521619
    ❌ FAIL - Vectors differ by 10% in some dimensions

[2] "Good morning everyone" (3 words - moderately padded)
    Individual first 10: -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020570, -0.001522, -0.009269, 0.012590, -0.046033
    Batch first 10:      0.016885, 0.045391, 0.059252, 0.012093, 0.026224, 0.045280, -0.008135, 0.000815, -0.007180, -0.023155
    Max element difference: 0.08880508
    Cosine similarity:      0.83390623
    ❌ FAIL - Vectors differ by 8% in some dimensions

[1] "The quick brown fox..." (15 words - longest text, minimal padding)
    ✅ PASS - Cosine similarity 0.9999 (works by coincidence because it's the longest)

Key observation: The bug severity correlates with padding amount - more padding = more corruption.

AFTER Fix (Correct Behavior)

All texts produce identical embeddings regardless of length:

[0] "Hello" (1 word)
    Individual first 10: 0.024392, 0.006443, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025229
    Batch first 10:      0.024392, 0.006444, 0.010115, 0.033713, -0.018591, -0.047560, -0.009647, -0.031803, -0.011711, -0.025230
    Max element difference: 0.00000012
    Cosine similarity:      1.00000012
    ✅ PASS - Differences are only floating-point precision errors

[2] "Good morning everyone" (3 words)
    Individual first 10: -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020570, -0.001522, -0.009269, 0.012590, -0.046033
    Batch first 10:      -0.002850, 0.065586, 0.037592, 0.014096, 0.033819, -0.020569, -0.001522, -0.009269, 0.012590, -0.046033
    Max element difference: 0.00000018
    Cosine similarity:      0.99999988
    ✅ PASS - Vectors match within numerical precision

✅ SUCCESS: All vectors match (all 4 test texts pass)

## The Bug

The `batchEncode()` method pads texts to the same length but doesn't create
an attention mask, causing the model to process padding tokens as real content.
This corrupts embeddings for shorter texts in a batch.

## The Fix

- Create attention mask tensor (1 for real tokens, 0 for padding)
- Pass attention mask to model so it ignores padding tokens
- Ensures batchEncode() produces identical results to individual encode() calls

## Test Evidence

Added comprehensive test suite demonstrating the fix with varied text lengths.
@jkrukowski
Copy link
Owner

Hi @dang-hai. Thanks for spotting the issue. I've decided to create a more comprehensive PR #21 lmk what you think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants