Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]Cross index (cross cluster) OpenSearch Join using OpenSearch hadoop plugin #753

Open
YANG-DB opened this issue Oct 8, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request Roadmap:Search Project-wide roadmap label

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Oct 8, 2024

Is your feature request related to a problem?
Currently, OpenSearch does not provide a way to perform cross-index or cross-cluster joins using the OpenSearch DSL. This RFC proposes to extend OpenSearch's capabilities by leveraging Spark's MPP engine through the OpenSearch Flint API and Hadoop OpenSearch Library. This will enable users to execute cross-index joins natively via Spark, while abstracting away the underlying complexity.

What solution would you like?

The Problem Statement

The OpenSearch Query engine lacks native support for cross-index (or cross-cluster) joins. This limitation hinders scenarios where users need to merge data residing in different indexes (or clusters) without manually combining the results at the application level or (in the SQL plugin case) in the OpenSearch coordinating node running the two sides of the join.

In large-scale data environments, this becomes a bottleneck for performing analytics or relational-style queries across distributed datasets.

Proposed Solution

We propose using Apache Spark's engine via the OpenSearch Flint API (using PPL Join commands) and Hadoop OpenSearch integration, to allow users to perform cross-index joins. The solution will:

  • Abstract the complexity of Spark: Users will be able to write a query in OpenSearch's PPL, but behind the scenes, the query will be translated into a Spark SQL query that can perform the join.

  • Support for relational joins: The system will allow different types of joins (INNER, LEFT OUTER, etc.) between indexes residing on the same or different clusters.

  • Leverage Spark’s distributed processing: By utilizing Spark's distributed architecture, the solution will ensure scalability and performance in handling large datasets.


Architecture

[OpenSearch Client (SQL Async API)] 
        ↓
[PPL Engine] → [Join Operation]
        ↓
[Catalyst AST Translator]
        ↓
[Spark Execution Engine]
        ↓
[Hadoop OpenSearch Connector]
        ↓
[Data Retrieval from Multiple Indexes/Clusters]

Point to Consider

Hive - Table

Today Spark / Flint can use to query Hive tables using OpenSearch hadoop .

Spark - SQL

Using spark SQL to read indices directly from Spark as shown here:

SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("opensearch").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2024)))

Do you have any additional context?

@YANG-DB YANG-DB added enhancement New feature or request Roadmap:Search Project-wide roadmap label labels Oct 8, 2024
@YANG-DB YANG-DB self-assigned this Oct 8, 2024
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 8, 2024

@penghuo @LantaoJin can you plz review and comment ?
thanks

@LantaoJin
Copy link
Member

LantaoJin commented Oct 10, 2024

IMO, there are two kinds of cases which will leverage opensearch-hadoop (haven't deep dive, from the description, it sounds a connector):

  1. Fundamental usage of opensearch-hadoop: An OpenSearch index which is mapped as a Hive external table can be queried from Hadoop ecosystem (MR, Hive, Spark, Presto).
  2. External usage: OpenSearch query (DSL/SQL/PPL) can query data crossing HDFS and OpenSearch index.

In the case 1, the target query is from Hadoop ecosystem (Hive, Spark, Presto). Sounds there is no requirement for enhancement here. It's what the opensearch-hadoop project did.
In the case 2, the target query is from OpenSearch (DSL/SQL/PPL). My question is what is the user story of this kind of crossing query. Why not just run SQL via case 1? What problem this solution resolve? Did you hear any request from community now?

@YANG-DB YANG-DB removed the untriaged label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Roadmap:Search Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

2 participants