Skip to content

Do not compute hit counts by default #33028

Closed
@jpountz

Description

@jpountz

Context
Lucene 8 introduces optimizations that allow to compute top hits more efficiently by skipping documents that do not produce competitive scores. We would like to enable this behavior by default so that users can opt in if they need accurate total hit counts, which are costly, rather than the other way around.

Not returning a hit count at all is problematic for traditional search UIs with pagination: say that you want to display up to 5 pages with 10 hits per page, you need to know whether the hit count is between 0 and 10, between 11 and 20, between 21 and 30, between 31 and 40, or greater than 40 in order to know how many pages need to be displayed. In order to address this issue, Lucene takes a configurable threshold: if the hit count is less than this threshold, then you will get an accurate hit count, otherwise you will get a lower bound of the hit count.

We don't want to discuss backward compatibility for now, let's focus what we want to have eventually, and only then discuss how we get there to make the change easy to digest for users. That's fine if we need multiple steps and the whole change is only available in 8 rather than 7.

Response format
We need a way to tell users whether the hit count is accurate or a lower bound. Multiple ideas have been mentioned:

  1. not modify the response format: if the user asked to count up to X and the hit count is greater than X, just return X as the hit count
  2. use a string that ends with a + as a way to say that the hit count is a lower bound, eg. { "hits": { "total": "1234+" } } when the hit count is a lower bound and { "hits": { "total": 1234 } } like today otherwise
  3. use another field that tells whether the hit count is accurate or a lower bound, eg. { "hits": { "total": 1234, "total_hits_relation": "gte" } }
  4. make hits.total an object with a value and a relation, eg. { "hits": { "total": { "value": 1234, "relation": "gte" } } }
  5. make hits.total an object that has two possible keys but only one is ever set, eg. { "hits": { "total": { "gte": 1234 } } } or { "hits": { "total": { "eq": 1234 } } }
  6. don't reuse hits.total at all and return a different field if ie. hits.min or some better name.

When we discussed these options, we felt like 1 would make parsing more complicated, and with 2 it would be too easy to miss the fact that you need to look up another field in order to know how to interpret the hit count.

Implementation options

Option 1: Make track_total_hits take a number

We already have a track_total_hits switch which we added for index sorting, but currently take a boolean. It would be easy to make it take a number instead that would be the minimum number of hits to count accurately.

We could ease transition to the new response format by using the current format when track_total_hits is unset or set to a boolean, and the new format when track_total_hits is set to a number, and then deprecate support for booleans.

Users who want accurate hit counts could set a very high value for this parameter, we could potentially allow using a special value like -1 as a way to mean "be accurate".

Option 2: Hardcoded number of hits to count

Always count accurately up to index.max_result_window hits. If users need accurate hits, they will need to use an aggregation (we need to add such an aggregation that counts docs).

Ping @elastic/es-clients to get opinions about the above thoughts, especially the response format.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions