- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
Description
Description
The Highlighter.getBestFragments() method merges contiguous fragments regardless of score, causing zero-scored (non-matching) fragments to be merged with scored fragments and returned as highlights. This results in large blocks of irrelevant text appearing in highlight results simply because they're adjacent to actual matches.
Environment
- Lucene version: 10.2.2 (also present in 10.3.1)
- Component: lucene-highlighter
Steps to Reproduce
- Create a document with a single term match ("credit") surrounded by substantial text
- Configure SimpleSpanFragmenterwithfragmentSize=100
- Call highlighter.getBestFragments()withmaxFragments=3
- Observe that 14 fragments are created, but only 1 has score > 0
- The FragmentQueue selects the top 3 fragments (fragment 0 with score 1.0, fragments 1 and 2 with score 0.0)
- mergeContiguousFragments()merges all three into a single ~300 char result
Actual Behaviour
With maxFragments=3, returns a single merged fragment of ~300 characters:
@meta name "Process Payment" @meta description "Process a payment for an order using <em>credit</em> card" @meta tags ["payments", "create", "checkout"] @meta collection "Payment Processing API" /* * Payment processing endpoint with PCI compliance * NOTE: All card data must be tokenised before
The result includes ~250 characters of zero-scored content merged with the ~50 characters containing the actual match.
Expected Behaviour
Highlights should only include fragments containing actual matches (score > 0). Zero-scored fragments should either:
- Not be selected by the FragmentQueue, or
- Not be merged with scored fragments, or
- Be filtered out before being returned
Expected result:
@meta description "Process a payment for an order using <em>credit</em> card"
Root Cause Analysis
In getBestTextFragments():
- The fragmenter creates 14 fragments across the document
- Only 1 fragment contains the search term and has score 1.0
- The remaining 13 fragments have score 0.0
- FragmentQueue(maxNumFragments)keeps the top N fragments by score
- Since 13 fragments have identical zero scores, the queue arbitrarily selects the first N-1 zero-scored fragments encountered (fragments 1, 2, etc.)
- mergeContiguousFragments()merges any adjacent fragments regardless of score
- The merged fragment inherits the highest score (1.0), so it passes the score > 0filter
Problematic Code
In Highlighter.getBestTextFragments(), the merge happens unconditionally:
if (mergeContiguousFragments) {
    mergeContiguousFragments(frag);
}Impact
- maxFragmentsparameter behaves counterintuitively - changing it from 2 to 3 changes the result size from ~200 to ~300 chars
- Users get large blocks of irrelevant text in their highlights
- No way to control this behaviour through configuration
Workaround
Call getBestTextFragments() directly with mergeContiguousFragments=false and manually filter:
TextFragment[] fragments = highlighter.getBestTextFragments(
    tokenStream, text, false, maxFragments);
List<String> results = new ArrayList<>();
for (TextFragment frag : fragments) {
    if (frag != null && frag.getScore() > 0) {
        results.add(frag.toString());
    }
}Suggested Fix
Option 1: Only merge fragments where both fragments have score > threshold (e.g., 0.1)
Option 2: Add a configuration parameter to control merge behaviour:
highlighter.setMergeScoreThreshold(0.1);Option 3: Filter zero-scored fragments from the FragmentQueue before merging
Questions
Is the current behaviour intentional? If so, would you consider adding configuration to control which fragments are eligible for merging based on their scores?
Version and environment details
No response