Description
Context
Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.
Not returning a hit count at all is problematic for traditional search UIs with pagination: say that you want to display up to 5 pages with 10 hits per page, you need to know whether the hit count is between 0 and 10, between 11 and 20, between 21 and 30, between 31 and 40, or greater than 40 in order to know how many pages need to be displayed. In order to address this issue, Lucene takes a configurable threshold: if the hit count is less than this threshold, then you will get an accurate hit count, otherwise you will get a lower bound of the hit count.
We don't want to discuss backward compatibility for now, let's focus what we want to have eventually, and only then discuss how we get there to make the change easy to digest for users. That's fine if we need multiple steps and the whole change is only available in 8 rather than 7.
Response format
We need a way to tell users whether the hit count is accurate or a lower bound. Multiple ideas have been mentioned:
- not modify the response format: if the user asked to count up to X and the hit count is greater than X, just return X as the hit count
- use a string that ends with a
+
as a way to say that the hit count is a lower bound, eg.{ "hits": { "total": "1234+" } }
when the hit count is a lower bound and{ "hits": { "total": 1234 } }
like today otherwise - use another field that tells whether the hit count is accurate or a lower bound, eg.
{ "hits": { "total": 1234, "total_hits_relation": "gte" } }
- make
hits.total
an object with a value and a relation, eg.{ "hits": { "total": { "value": 1234, "relation": "gte" } } }
- make
hits.total
an object that has two possible keys but only one is ever set, eg.{ "hits": { "total": { "gte": 1234 } } }
or{ "hits": { "total": { "eq": 1234 } } }
- don't reuse
hits.total
at all and return a different field if ie.hits.min
or some better name.
When we discussed these options, we felt like 1 would make parsing more complicated, and with 2 it would be too easy to miss the fact that you need to look up another field in order to know how to interpret the hit count.
Implementation options
Option 1: Make track_total_hits
take a number
We already have a track_total_hits
switch which we added for index sorting, but currently take a boolean. It would be easy to make it take a number instead that would be the minimum number of hits to count accurately.
We could ease transition to the new response format by using the current format when track_total_hits
is unset or set to a boolean, and the new format when track_total_hits
is set to a number, and then deprecate support for booleans.
Users who want accurate hit counts could set a very high value for this parameter, we could potentially allow using a special value like -1
as a way to mean "be accurate".
Option 2: Hardcoded number of hits to count
Always count accurately up to index.max_result_window
hits. If users need accurate hits, they will need to use an aggregation (we need to add such an aggregation that counts docs).
Ping @elastic/es-clients to get opinions about the above thoughts, especially the response format.