Skip to content

Provide true real-time indexing for Lucene based text index #13504

Open
@itschrispeck

Description

Problem

Currently, Pinot's RealtimeLuceneTextIndex uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.

This behavior presents in a couple ways:

  1. text_match(col, '"abcd"') -> forward match misses the most recent docs
  2. NOT text_match(col, '"abcd"') -> inverse match fails to exclude the most recent docs, so users will see docs containing abcd
  3. Missing results for upsert, for example:
    t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
    t1: doc A text indexed, doc A searchable w/ text index
    t2: doc B ingested/doc B is the valid doc based on upsert latest docs
    <text_match query returns doc A, but upsert invalidated doc A, no results>
    t3: doc B text indexed, doc B searchable w/ text index
    <text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
    

With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.

Alternatives considered:

  • bound the most recent doc considered during query execution based on index refresh delay

    • For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' numDocs if the data source has a text index.
    • This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
    • This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
  • rewrite NOT text_match(col, '"abcd"') to text_match(col, '/.*/ AND NOT "abcd"')

    • this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions