Analytics engine - add support for index patterns, aliases, multi-index search#21822
Analytics engine - add support for index patterns, aliases, multi-index search#21822mch2 wants to merge 5 commits into
Conversation
Enables queries against aliases, wildcard patterns, and comma-separated index expressions. The planner resolves these to concrete indices, validates schema compatibility, and builds a union row type. At the data node, Rust widens the registered ListingTable schema from the plan's base_schema so DataFusion null-fills columns this shard doesn't have. Key components: - IndexResolution: expands aliases/patterns to concrete indices, validates field type compatibility, rejects filter aliases and data streams - FieldStorageResolver.merged(): unions per-field storage across backing indices - ShardTargetResolver: fans out shard routing across all concrete indices - widen_schema_from_plan (Rust): appends missing nullable columns to the ListingTable using from_substrait_named_struct for type conversion - UnifiedQueryService: preserves lazy table resolution for wildcard support - Indexed execution (filter delegation): now passes plan bytes to Rust, enabling multi-index support on the delegation path Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
PR Reviewer Guide 🔍(Review updated until commit bd5a25f)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to bd5a25f Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit 0e1972a
Suggestions up to commit abf0dcb
Suggestions up to commit b0c13f5
Suggestions up to commit 576c8e8
Suggestions up to commit c2ecd23
|
1ace488 to
c2ecd23
Compare
|
Persistent review updated to latest commit c2ecd23 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21822 +/- ##
============================================
+ Coverage 73.34% 73.42% +0.08%
+ Complexity 75417 75404 -13
============================================
Files 6032 6032
Lines 342404 342404
Branches 49235 49235
============================================
+ Hits 251142 251425 +283
+ Misses 71272 70915 -357
- Partials 19990 20064 +74 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Persistent review updated to latest commit 576c8e8 |
|
❌ Gradle check result for 576c8e8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 576c8e8: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
sandeshkr419
left a comment
There was a problem hiding this comment.
Thanks for the changes @mch2
Let's add up test cases for these as well if:
- case-insensitive table resolution
- test for
findTableNameon a multi-input join/union shape - rust tests for
widen_schema_from_plan first_named_table_name/base_schema_for_tablewith multi-scan plans- nested field schema mismatch
- IndexResolutionTests tests for the closed-index case on the concrete-index path.
IndexResolutionTests.testMissingNameThrowsdoesn't pass a resolver, so it can't actually exercise the wildcard fallback, check once.OpenSearchSchemaBuilderTests.testCommaSeparatedSourcesResolveToUnionedTabledoesn't test field-conflict behavior.
(I may have missed in the code if some of the above tests actually exist and I overlooked through them - just validate that they do please)
|
Persistent review updated to latest commit b0c13f5 |
- Case-insensitive table resolution test (OpenSearchSchemaBuilderTests) - findTableName on join/union shapes (RelNodeUtilsTests) + generalize findTableName to match any TableScan, not just OpenSearchTableScan - Field-conflict test for comma-separated sources (first-wins semantics) - IndexResolution: concrete name with resolver, exclusion pattern test - Rust: empty/garbage input tests for first_named_table_name - Rust: widen_schema_from_plan noop tests (empty plan, all cols present) Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
b0c13f5 to
abf0dcb
Compare
|
Persistent review updated to latest commit abf0dcb |
|
Persistent review updated to latest commit 0e1972a |
Signed-off-by: Marc Handalian <marc.handalian@gmail.com>
0e1972a to
bd5a25f
Compare
|
Persistent review updated to latest commit bd5a25f |
|
❌ Gradle check result for bd5a25f: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
| for (String name : clusterState.metadata().getIndicesLookup().keySet()) { | ||
| Table table = resolveTable(clusterState, resolver, name); | ||
| if (table != null) { | ||
| super.put(name, table); | ||
| } |
There was a problem hiding this comment.
This might be pretty expensive right?
Description
Summary
Adds support for alias, index pattern, and multi-index queries to the analytics engine. A query like
source=my_aliasnow fans out across all backing indices, with schema widening to null-fill columns that individual shards don't have.Planning: When the planner encounters a table name that resolves to multiple indices, it validates that any field shared across indices has a compatible type (e.g.
textandkeywordboth map to VARCHAR —fine;
longvskeyword— rejected with a clear error). It then builds a single scan node whose row type is the union of all fields across all backing indices, and routes the query to shards on every backing index.Execution: Each shard registers its local parquet table with the schema inferred from its files. But the Substrait plan references the full union schema. To bridge the gap, during session creation extract the plan's declared schema (
base_schema) and appends any columns this shard doesn't have as nullable. DataFusion's built-in adapter then produces null values for those columns at read time.For single-index queries this widening is a no-op: the plan's schema matches the shard's schema, so the field-name comparison exits immediately with no work done.
Table name binding: The plan references the logical name (e.g.
"my_alias"), but each shard knows itself by its concrete index name (e.g."idx_a"). Session creation extracts the logical name from the plan'sNamedTableand registers the table under it, so the Substrait consumer can bind. For single-index queries, the logical name equals the concrete name — same code path, no branching.Wildcard support (test frontend): The PPL frontend's schema lookup was flattening tables into a static map, losing the lazy resolution that handles wildcards. Fixed by delegating
get()to the underlyingschema which resolves expressions on demand.
Filter delegation (indexed path): The indexed execution path (filter delegation to Lucene) now also receives plan bytes, enabling multi-index support for queries with
MATCHpredicates across aliases.Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.