Skip to content

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query #948

@dai-chen

Description

@dai-chen

Is your feature request related to a problem?
One of the key technical challenge in #719 is how to maintain the consistency between base table (S3 data) and derived table (OpenSearch index/materialized view).

What solution would you like?
One solution for the problem is to refresh new data from S3 to OpenSearch incrementally. We are proposing to enhance our query engine by unifying the batch processing and stream processing capability in single architecture as existing solution in Apache Flink and Spark. In particular, the enhancement includes changes in query planning, query execution engine and query plan itself.

PoC branch: https://github.com/opensearch-project/sql/tree/poc/maximus-m1. User manual and design doc in details will be published later as planned below.

What alternatives have you considered?
The alternative solution is rebuild the derived table (full refresh) on user demand or regular basis. This can be done by current batch processing architecture, however, introduce significant overhead for large S3 dataset it will.

Do you have any additional context?

Phase 1

Goal:

  • Ready for performance evaluation
  • Ready for feature evaluation
  • Missing
    • Failure recovery
    • Security

Tasks

Phase 2

Goal:

  • Ready for experimental release
  • Missing
    • Pipeline Execution
    • Distributed Execution

Tasks

Phase 3

Goal:

  • Ready for production deployment

Tasks

  • Pipeline Execution
  • Distributed Execution

Metadata

Metadata

Labels

MetaMeta issue, not directly linked to a PRfeature

Type

No type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions