Support `RelationSubquery` PPL #775

LantaoJin · 2024-10-12T10:30:57Z

Description

(Relation) Subquery usage

InSubquery, ExistsSubquery and ScalarSubquery are all subquery expressions. But RelationSubquery is not a subquery expression, it is a subquery plan which is common used in Join or Search/From clause.

source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] (subquery in join right side)
source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1

SQL Migration examples with Subquery PPL:

tpch q13

select
    c_count,
    count(*) as custdist
from
    (
        select
            c_custkey,
            count(o_orderkey) as c_count
        from
            customer left outer join orders on
                c_custkey = o_custkey
                and o_comment not like '%special%requests%'
        group by
            c_custkey
    ) as c_orders
group by
    c_count
order by
    custdist desc,
    c_count desc

Rewritten by PPL (Relation) Subquery:

SEARCH source = [
  SEARCH source = customer
  | LEFT OUTER JOIN left = c right = o ON c_custkey = o_custkey
    [
      SEARCH source = orders
      | WHERE not like(o_comment, '%special%requests%')
    ]
  | STATS COUNT(o_orderkey) AS c_count BY c_custkey
] AS c_orders
| STATS COUNT(o_orderkey) AS c_count BY c_custkey
| STATS COUNT(1) AS custdist BY c_count
| SORT - custdist, - c_count

Issues Resolved

Resolve #713 as a sub-task of #661

Check List

Updated documentation (ppl-spark-integration/README.md)
Implemented unit tests
Implemented tests for combination with other commands
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>

YANG-DB · 2024-10-15T01:54:34Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

@@ -55,9 +55,9 @@ commands
   ;

 searchCommand


@LantaoJin can you please share more details of the usage of FROM here ?
maybe give some examples when to use SEARCH vs FROM ?
thanks

There is no specific differences between the keywords SEARCH and FROM on the syntax perspective. I just thought we might need an alias for the SEARCH keyword since the PPL should be more aggressive on query semantic. PPL will have more powerful and functionality on data exploration and analytics beyond searching text or operating time-series data. The similar things are happening in industry, such as pipe-syntax BigQuery in Google and KQL tabular operator.
Even in Splunk SPL2, from command is separated from search command.
A query will be well-rewritten with a piped query language by who are familiar with SQL:
Ansi SQL:

SELECT sum(bytes) AS sum, host FROM main WHERE earliest=-5m@m GROUP BY host

SPL2 from query:

| FROM main WHERE earliest=-5m@m GROUP BY host SELECT sum(bytes) AS sum, host

But now I think I will revert this keyword alias since it should be fully discussed before delivering to our customers. Hope we can enhance the vision of PPL to contain more data analytics semantic.

@LantaoJin this sounds a great idea and a very clean and clear vision going forward.
IMO please add this as a separate PR (under the doc/planning folder) including the explanations, concepts and comparative analysis u've shown here ...
thanks again !!

Signed-off-by: Lantao Jin <ltjin@amazon.com>

LantaoJin · 2024-10-15T08:01:08Z

docs/ppl-lang/ppl-search-command.md

@@ -32,7 +32,7 @@ The example show fetch all the document from accounts index with .

 PPL query:

-    os> source=accounts account_number=1 or gender="F";
+    os> SEARCH source=accounts account_number=1 or gender="F";


There are two queries as examples. One example query ignores SEARCH keyword. Keep a SEARCH keyword in this example query.

LantaoJin · 2024-10-15T08:03:29Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

 tableSourceClause
-   : tableSource (COMMA tableSource)*
+   : tableSource (COMMA tableSource)* (AS alias = qualifiedName)?


Table alias is useful in query which contains a subquery, for example

select a, ( select sum(b) from catalog.schema.table1 as t1 where t1.a = t2.a ) sum_b from catalog.schema.table2 as t2

t1 and t2 are table aliases which are used in correlated subquery, sum_b are subquery alias

thanks for the detailed review - can you also add this explanation to the ppl-subquery-command doc ?
thanks

Yes. I will give more examples in doc.

penghuo · 2024-10-15T23:39:35Z

ppl-spark-integration/src/main/antlr4/OpenSearchPPLParser.g4

+
+// One tableSourceClause will generate one Relation node with/without one alias
+// even if the relation contains more than one table sources.
+// These table sources in one relation will be readed one by one in OpenSearch.


are we support this source = tb1, tb2, tb3 as tbl
should we treat tb1,tb2,tb3 as single table, and let datasource connector handle it?, right? for instance,

source=`tb1, tb2, tb3` as tbl

Yes. The current valid syntax is source = tb1, tb2, tb3 as tbl or source = `tb1`, `tb2`, `tb3` as tbl.
Indexes tb1, tb2 and tb3 will be converted to one relation in OpenSearch and tbl is the alias of this relation.
Spark doesn't support comma-tables as a relation. Catalyst will throw Table not found if a UnresolvedRelation with comma-named table identifier.
source=`tb1, tb2, tb3` as tbl equals to source = tb1, tb2, tb3 as tbl for Spark and fail in resolution because the reason I just said.
For OpenSearch index, source=`tb1, tb2, tb3` as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

source = tb1, tb2, tb3 as tbl and source = `tb1`, `tb2`, `tb3` as tbl are valid because we never need the pattern
source = tb1 as t1, tb2 as t2, tb3 as t3 since there is no meaningful. tb1 tb2 tb3 are combined in one relation in plan.
That's why source=`tb1, tb2, tb3` as tbl will be treated the relation name "tb1, tb2, tb3" and should fail with table not found.

The current valid syntax is source = tb1, tb2, tb3 as tbl or source = tb1, tb2, tb3 as tbl

is it a valid grammer in spark-sql. If not, does it confuse user?

OpenSearch index, source=tb1, tb2, tb3 as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

PPL on OpenSearch support it, it is multiple opensearch index.

Should we let the Catalog handle table name resolution? for openserach catalog, it can resolve table name as multiple index properly. for instance,catalog.namespace.index-2024*,index-2023-12,

The current valid syntax is source = tb1, tb2, tb3 as tbl or source = `tb1`, `tb2`, `tb3` as tbl

is it a valid grammer in spark-sql. If not, does it confuse user?

Yes. It is a valid grammer in opensearch-spark. For example

search source=test1, test2 or search source=`test1`, `test2`

generate a Spark plan with Union

'Union :- 'Project [*] : +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test1], [], false +- 'Project [*] +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test2], [], false

OpenSearch index, source=`tb1, tb2, tb3` as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

PPL on OpenSearch support it, it is multiple opensearch index.

Oh, that is the key difference, opensearch-spark can't handle it since "tb1, tb2, tb3" in backticks will be handled as a whole and name with comma is invalid in Spark.

PPL on OpenSearch supports:

source=accounts, account2

source=`accounts`,`account2`

source=`accounts, account2`

But PPL on Spark supports the first two. I would suggest to mark the third as invalid since users will treat the content in backticks as a whole as usual. `accounts, account2` seems more specific for OpenSearch domain. For the instance you provided above, my suggestion is treating content in backticks as a whole. @penghuo

√ source=`catalog`.`namespace`.`index-2024*`, `catalog`.`namespace`.`index-2023-12`

√ source=`catalog`.`namespace`.index-2024*, index-2023-12

× source=`catalog`.`namespace`.`index-2024*, index-2023-12`

Any different thoughts? I think it's worth to open a meta issue in sql repo for further discussion if we couldn't get align here. This context in a closed PR could be easily lost.

Signed-off-by: Lantao Jin <ltjin@amazon.com>

LantaoJin · 2024-10-16T06:40:21Z

@penghuo @YANG-DB In commit 1a4451a, I added integ-test and documentation for the case which subquery in search filter. It's to explain how a subquery to rewrite a subsearch in SPL we discussed offline.
SPL:

sourcetype=access_* status=200 action=purchase [search sourcetype=access_* status=200 action=purchase | top limit=1 clientip | table clientip] | stats count, dc(productId), values(productId) by clientip

PPL:

sourcetype=access_* status=200 action=purchase clientip=[search sourcetype=access_* status=200 action=purchase | top limit=1 clientip | fields clientip] | stats count, dc(productId), values(productId) by clientip

YANG-DB

@LantaoJin excellent documentation - thanks !!

LantaoJin added 2 commits October 12, 2024 18:25

Support RelationSubquery PPL

64685fb

Signed-off-by: Lantao Jin <ltjin@amazon.com>

fix doc

27c1faa

Signed-off-by: Lantao Jin <ltjin@amazon.com>

LantaoJin added Lang:PPL Pipe Processing Language support 0.6 labels Oct 12, 2024

LantaoJin marked this pull request as ready for review October 12, 2024 12:36

LantaoJin requested review from dai-chen, rupal-bq, vamsi-amazon, penghuo, seankao-az, anirudha, kaituo and YANG-DB as code owners October 12, 2024 12:36

YANG-DB reviewed Oct 15, 2024

View reviewed changes

LantaoJin added 2 commits October 15, 2024 15:57

revert the FROM alias

93ff020

Signed-off-by: Lantao Jin <ltjin@amazon.com>

Merge remote-tracking branch 'upstream/main' into issues/713

dfc3eb0

LantaoJin commented Oct 15, 2024

View reviewed changes

LantaoJin requested a review from YANG-DB October 15, 2024 08:12

YANG-DB approved these changes Oct 15, 2024

View reviewed changes

penghuo reviewed Oct 16, 2024

View reviewed changes

LantaoJin added 2 commits October 16, 2024 14:35

add the case for subquery in search filter

1a4451a

Signed-off-by: Lantao Jin <ltjin@amazon.com>

Merge remote-tracking branch 'upstream/main' into issues/713

863dfb8

LantaoJin requested a review from penghuo October 16, 2024 09:21

YANG-DB approved these changes Oct 16, 2024

View reviewed changes

YANG-DB merged commit 9d3909b into opensearch-project:main Oct 17, 2024
4 checks passed

penghuo mentioned this pull request Oct 23, 2024

[FEATURE] OpenSearch SQL/PPL multiple indices support opensearch-project/sql#3099

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `RelationSubquery` PPL #775

Support `RelationSubquery` PPL #775

LantaoJin commented Oct 12, 2024 •

edited

Loading

YANG-DB Oct 15, 2024

LantaoJin Oct 15, 2024

YANG-DB Oct 15, 2024

LantaoJin Oct 15, 2024

LantaoJin Oct 15, 2024 •

edited

Loading

YANG-DB Oct 15, 2024

LantaoJin Oct 16, 2024

penghuo Oct 15, 2024

LantaoJin Oct 16, 2024

LantaoJin Oct 16, 2024 •

edited

Loading

penghuo Oct 21, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin commented Oct 16, 2024 •

edited

Loading

YANG-DB left a comment

Support RelationSubquery PPL #775

Support RelationSubquery PPL #775

Conversation

LantaoJin commented Oct 12, 2024 • edited Loading

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LantaoJin Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LantaoJin Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

penghuo Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

LantaoJin Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

LantaoJin Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

LantaoJin Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

LantaoJin commented Oct 16, 2024 • edited Loading

YANG-DB left a comment

Choose a reason for hiding this comment

Support `RelationSubquery` PPL #775

Support `RelationSubquery` PPL #775

LantaoJin commented Oct 12, 2024 •

edited

Loading

LantaoJin Oct 15, 2024 •

edited

Loading

LantaoJin Oct 16, 2024 •

edited

Loading

penghuo Oct 21, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin Oct 23, 2024 •

edited

Loading

LantaoJin commented Oct 16, 2024 •

edited

Loading