Skip to content

Commit

Permalink
Avoid negative scores returned from multi_match query with cross_fields
Browse files Browse the repository at this point in the history
Under specific circumstances, when using `cross_fields` scoring on a
`multi_match` query, we can end up with negative scores from the inverse
document frequency calculation in the BM25 formula.

Specifically, the IDF is calculated as:

```
log(1 + (N - n + 0.5) / (n + 0.5))
```

where `N` is the number of documents containing the field and `n` is the
number of documents containing the given term in the field. Obviously,
`n` should always be less than or equal to `N`.

Unfortunately, `cross_fields` makes up a new value for `n` and tries to
use it across all fields.

This change finds the minimum (nonzero) value of `N` and uses that as an
upper bound for the new value of `n`.

Signed-off-by: Michael Froh <froh@amazon.com>
  • Loading branch information
msfroh committed May 25, 2024
1 parent 56d8dc6 commit a50898c
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
"Cross fields do not return negative scores":
- do:
index:
index: test
id: 1
body: { "color" : "orange red yellow" }
- do:
index:
index: test
id: 2
body: { "color": "orange red purple", "shape": "red square" }
- do:
index:
index: test
id: 3
body: { "color" : "orange red yellow purple" }
- do:
indices.refresh: { }
- do:
search:
index: test
body:
query:
multi_match:
query: "red"
type: "cross_fields"
fields: [ "color", "shape^100"]
tie_breaker: 0.1
explain: true
- match: { hits.total.value: 3 }
- match: { hits.hits.0._id: "2" }
- gt: { hits.hits.2._score: 0.0 }
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,7 @@ protected void blend(final TermStates[] contexts, int maxDoc, IndexReader reader
}
int max = 0;
long minSumTTF = Long.MAX_VALUE;
int minDocCount = Integer.MAX_VALUE;
for (int i = 0; i < contexts.length; i++) {
TermStates ctx = contexts[i];
int df = ctx.docFreq();
Expand All @@ -133,11 +134,15 @@ protected void blend(final TermStates[] contexts, int maxDoc, IndexReader reader
// we need to find out the minimum sumTTF to adjust the statistics
// otherwise the statistics don't match
minSumTTF = Math.min(minSumTTF, reader.getSumTotalTermFreq(terms[i].field()));
minDocCount = Math.min(minDocCount, reader.getDocCount(terms[i].field()));
}
}
if (maxDoc > minSumTTF) {
maxDoc = (int) minSumTTF;
}
if (maxDoc > minDocCount) {
maxDoc = minDocCount;
}
if (max == 0) {
return; // we are done that term doesn't exist at all
}
Expand Down

0 comments on commit a50898c

Please sign in to comment.