Skip to content

Highlighter.getBestFragments() merges zero-scored fragments with scored fragments, polluting highlight results #15333

@tomjmul

Description

@tomjmul

Description

The Highlighter.getBestFragments() method merges contiguous fragments regardless of score, causing zero-scored (non-matching) fragments to be merged with scored fragments and returned as highlights. This results in large blocks of irrelevant text appearing in highlight results simply because they're adjacent to actual matches.

Environment

  • Lucene version: 10.2.2 (also present in 10.3.1)
  • Component: lucene-highlighter

Steps to Reproduce

  1. Create a document with a single term match ("credit") surrounded by substantial text
  2. Configure SimpleSpanFragmenter with fragmentSize=100
  3. Call highlighter.getBestFragments() with maxFragments=3
  4. Observe that 14 fragments are created, but only 1 has score > 0
  5. The FragmentQueue selects the top 3 fragments (fragment 0 with score 1.0, fragments 1 and 2 with score 0.0)
  6. mergeContiguousFragments() merges all three into a single ~300 char result

Actual Behaviour

With maxFragments=3, returns a single merged fragment of ~300 characters:

@meta name "Process Payment" @meta description "Process a payment for an order using <em>credit</em> card" @meta tags ["payments", "create", "checkout"] @meta collection "Payment Processing API" /* * Payment processing endpoint with PCI compliance * NOTE: All card data must be tokenised before

The result includes ~250 characters of zero-scored content merged with the ~50 characters containing the actual match.

Expected Behaviour

Highlights should only include fragments containing actual matches (score > 0). Zero-scored fragments should either:

  1. Not be selected by the FragmentQueue, or
  2. Not be merged with scored fragments, or
  3. Be filtered out before being returned

Expected result:

@meta description "Process a payment for an order using <em>credit</em> card"

Root Cause Analysis

In getBestTextFragments():

  1. The fragmenter creates 14 fragments across the document
  2. Only 1 fragment contains the search term and has score 1.0
  3. The remaining 13 fragments have score 0.0
  4. FragmentQueue(maxNumFragments) keeps the top N fragments by score
  5. Since 13 fragments have identical zero scores, the queue arbitrarily selects the first N-1 zero-scored fragments encountered (fragments 1, 2, etc.)
  6. mergeContiguousFragments() merges any adjacent fragments regardless of score
  7. The merged fragment inherits the highest score (1.0), so it passes the score > 0 filter

Problematic Code

In Highlighter.getBestTextFragments(), the merge happens unconditionally:

if (mergeContiguousFragments) {
    mergeContiguousFragments(frag);
}

Impact

  • maxFragments parameter behaves counterintuitively - changing it from 2 to 3 changes the result size from ~200 to ~300 chars
  • Users get large blocks of irrelevant text in their highlights
  • No way to control this behaviour through configuration

Workaround

Call getBestTextFragments() directly with mergeContiguousFragments=false and manually filter:

TextFragment[] fragments = highlighter.getBestTextFragments(
    tokenStream, text, false, maxFragments);

List<String> results = new ArrayList<>();
for (TextFragment frag : fragments) {
    if (frag != null && frag.getScore() > 0) {
        results.add(frag.toString());
    }
}

Suggested Fix

Option 1: Only merge fragments where both fragments have score > threshold (e.g., 0.1)

Option 2: Add a configuration parameter to control merge behaviour:

highlighter.setMergeScoreThreshold(0.1);

Option 3: Filter zero-scored fragments from the FragmentQueue before merging

Questions

Is the current behaviour intentional? If so, would you consider adding configuration to control which fragments are eligible for merging based on their scores?

Version and environment details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions