Open
Description
Problem
Currently, Pinot's RealtimeLuceneTextIndex
uses Lucene's near real-time indexing functionality. Some effort has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing.
This behavior presents in a couple ways:
text_match(col, '"abcd"')
-> forward match misses the most recent docsNOT text_match(col, '"abcd"')
-> inverse match fails to exclude the most recent docs, so users will see docs containingabcd
- Missing results for upsert, for example:
t0: doc A ingested/doc A is the valid doc based on upsert lastest docs t1: doc A text indexed, doc A searchable w/ text index t2: doc B ingested/doc B is the valid doc based on upsert latest docs <text_match query returns doc A, but upsert invalidated doc A, no results> t3: doc B text indexed, doc B searchable w/ text index <text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.
Alternatives considered:
-
bound the most recent doc considered during query execution based on index refresh delay
- For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting'
numDocs
if the data source has a text index. - This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match.
- This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}
- For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting'
-
rewrite
NOT text_match(col, '"abcd"')
totext_match(col, '/.*/ AND NOT "abcd"')
- this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)
Activity