Provide true real-time indexing for Lucene based text index

### Problem
Currently, Pinot's `RealtimeLuceneTextIndex` uses Lucene's near real-time indexing functionality. Some [effort](https://github.com/apache/pinot/pull/13503) has been made to reduce the delay already. However, due to the nature of the implementation true real-time indexing is still missing. 

This behavior presents in a couple ways: 
1. `text_match(col, '"abcd"')` -> forward match misses the most recent docs
2. `NOT text_match(col, '"abcd"')` -> inverse match fails to exclude the most recent docs, so users will see docs containing `abcd`
3. Missing results for upsert, for example:
   ```
   t0: doc A ingested/doc A is the valid doc based on upsert lastest docs
   t1: doc A text indexed, doc A searchable w/ text index
   t2: doc B ingested/doc B is the valid doc based on upsert latest docs
   <text_match query returns doc A, but upsert invalidated doc A, no results>
   t3: doc B text indexed, doc B searchable w/ text index
   <text_match query returns doc B, doc B is searchable w/ text index and a valid doc, expected results>
   ```

With delay minimized, we can provide a small, in-memory, true realtime index to bridge the gap between NRT functionality and docs ingested in Pinot using Lucene primitives.

### Alternatives considered:
- bound the most recent doc considered during query execution based on index refresh delay
   - For the V1 query engine, I think this can be done in FilterOperatorUtils by 'adjusting' `numDocs` if the data source has a text index. 
   - This does not solve the freshness issues (inconsistenties w/ query response metadata), but will avoid the correctness issues seen by inverse match. 
   - This does not solve the upsert case, but changes the scope of the issue from results = {correct, extraneous, missing} to results = {correct, missing}

- rewrite `NOT text_match(col, '"abcd"')` to `text_match(col, '/.*/ AND NOT "abcd"')`
   - this carries some unwanted performance implications, but could be used to guarantee query correctness (i.e. don't include results that should be excluded)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide true real-time indexing for Lucene based text index #13504

Problem

Alternatives considered:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development