Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query rewrite with secondary index and materialized view #1435

Closed
Tracked by #3
dai-chen opened this issue Mar 13, 2023 · 3 comments
Closed
Tracked by #3

Query rewrite with secondary index and materialized view #1435

dai-chen opened this issue Mar 13, 2023 · 3 comments

Comments

@dai-chen
Copy link
Collaborator

dai-chen commented Mar 13, 2023

Is your feature request related to a problem?

In #1379 and #1407, we're able to build secondary index and materialized view on Maximus table. However, there are still some work remaining for both query acceleration:

  1. For secondary index, streaming query won't be rewritten and accelerated
  2. For materialized view, user has to query materialized view explicitly instead of original base table

What solution would you like?

Rewrite query from user with available secondary index or materialized view on the Maximus table to to accelerate its execution.

What alternatives have you considered?

Consider query rewrite limitation and out of scope for now.

Do you have any additional context?

N/A

@dai-chen dai-chen self-assigned this Mar 13, 2023
@dai-chen dai-chen changed the title Query rewrite with materialized view Query rewrite with secondary index and materialized view Mar 14, 2023
@muralikpbhat
Copy link

Nice. Will this be capable of acceleration (across fields) in case of covering index with selective fields?

Say, if we index only field 'a' into opensearch and the query is something like 'avg(b) where a startswith "blah"'. OpenSearch will answer the prefix query on 'blah*' since it is indexed, but are we planing to use those results in the spark scan for b while aggregating?
Also, given the results from blah* can come from multiple shards, how are we planning to parallelize spark scan? I am assuming spark treating OS as a whole and not integrating at shard level. It seems like the results from blah* has to be sent each spark partition and they will have to filter for those docid...? Will this be really faster than the scenario where both a and b are processed by spark?

@dai-chen
Copy link
Collaborator Author

@muralikpbhat Thanks for the comment!

I think your example is more for fine-grained filtering index. Actually we're focused on the following index data structure:

  1. Fine-grained covering index or MV: the index or MV can answer the query by itself with need to look at source data again. In other word, if it's loaded into OpenSearch, the OS index can work alone with search or visualization.
  2. Coarse-grained skipping index: after filtering by a startswith "blah", rather than telling which source row, skipping index will tell us which source file(s) may have the answer. With such file list, Spark will do the rest (planning job, fetching data, aggregate, join and then load into OS if the query comes from MV)

@dai-chen
Copy link
Collaborator Author

dai-chen commented Mar 28, 2023

I'm closing this because query rewrite for MV is deprioritized.

The main reason is performance overhead introduced by current logical integration between OpenSearch and Spark. Thinking of MV as a managed Maximus table and rewrite query in Spark make sense. However, the cost of transferring data between Spark and OpenSearch maybe high. Meanwhile there are limitations in Spark SQL support for OpenSearch Dashboard which requires extension.

So we'd like to make MV data a regular OpenSearch index so OpenSearch can access it directly in current initial phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants