Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support RelationSubquery PPL #775

Merged
merged 6 commits into from
Oct 17, 2024

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Oct 12, 2024

Description

(Relation) Subquery usage

InSubquery, ExistsSubquery and ScalarSubquery are all subquery expressions. But RelationSubquery is not a subquery expression, it is a subquery plan which is common used in Join or Search/From clause.

  • source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] (subquery in join right side)
  • source = [ source = table1 | join left = l right = r [ source = table2 | where d > 10 | head 5 ] | stats count(a) by b ] as outer | head 1

SQL Migration examples with Subquery PPL:

tpch q13

select
    c_count,
    count(*) as custdist
from
    (
        select
            c_custkey,
            count(o_orderkey) as c_count
        from
            customer left outer join orders on
                c_custkey = o_custkey
                and o_comment not like '%special%requests%'
        group by
            c_custkey
    ) as c_orders
group by
    c_count
order by
    custdist desc,
    c_count desc

Rewritten by PPL (Relation) Subquery:

SEARCH source = [
  SEARCH source = customer
  | LEFT OUTER JOIN left = c right = o ON c_custkey = o_custkey
    [
      SEARCH source = orders
      | WHERE not like(o_comment, '%special%requests%')
    ]
  | STATS COUNT(o_orderkey) AS c_count BY c_custkey
] AS c_orders
| STATS COUNT(o_orderkey) AS c_count BY c_custkey
| STATS COUNT(1) AS custdist BY c_count
| SORT - custdist, - c_count

Issues Resolved

Resolve #713 as a sub-task of #661

Check List

  • Updated documentation (ppl-spark-integration/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added Lang:PPL Pipe Processing Language support 0.6 labels Oct 12, 2024
@LantaoJin LantaoJin marked this pull request as ready for review October 12, 2024 12:36
@@ -55,9 +55,9 @@ commands
;

searchCommand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin can you please share more details of the usage of FROM here ?
maybe give some examples when to use SEARCH vs FROM ?
thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specific differences between the keywords SEARCH and FROM on the syntax perspective. I just thought we might need an alias for the SEARCH keyword since the PPL should be more aggressive on query semantic. PPL will have more powerful and functionality on data exploration and analytics beyond searching text or operating time-series data. The similar things are happening in industry, such as pipe-syntax BigQuery in Google and KQL tabular operator.
Even in Splunk SPL2, from command is separated from search command.
A query will be well-rewritten with a piped query language by who are familiar with SQL:
Ansi SQL:

SELECT sum(bytes) AS sum, host
FROM main WHERE earliest=-5m@m
GROUP BY host

SPL2 from query:

| FROM main
WHERE earliest=-5m@m GROUP BY host
SELECT sum(bytes) AS sum, host

But now I think I will revert this keyword alias since it should be fully discussed before delivering to our customers. Hope we can enhance the vision of PPL to contain more data analytics semantic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin this sounds a great idea and a very clean and clear vision going forward.
IMO please add this as a separate PR (under the doc/planning folder) including the explanations, concepts and comparative analysis u've shown here ...
thanks again !!

@@ -32,7 +32,7 @@ The example show fetch all the document from accounts index with .

PPL query:

os> source=accounts account_number=1 or gender="F";
os> SEARCH source=accounts account_number=1 or gender="F";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two queries as examples. One example query ignores SEARCH keyword. Keep a SEARCH keyword in this example query.

tableSourceClause
: tableSource (COMMA tableSource)*
: tableSource (COMMA tableSource)* (AS alias = qualifiedName)?
Copy link
Member Author

@LantaoJin LantaoJin Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table alias is useful in query which contains a subquery, for example

select a, (  
             select sum(b) 
             from catalog.schema.table1 as t1 
             where t1.a = t2.a
          )  sum_b
 from catalog.schema.table2 as t2

t1 and t2 are table aliases which are used in correlated subquery, sum_b are subquery alias

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed review - can you also add this explanation to the ppl-subquery-command doc ?
thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will give more examples in doc.


// One tableSourceClause will generate one Relation node with/without one alias
// even if the relation contains more than one table sources.
// These table sources in one relation will be readed one by one in OpenSearch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we support this source = tb1, tb2, tb3 as tbl
should we treat tb1,tb2,tb3 as single table, and let datasource connector handle it?, right? for instance,

source=`tb1, tb2, tb3` as tbl

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The current valid syntax is source = tb1, tb2, tb3 as tbl or source = `tb1`, `tb2`, `tb3` as tbl.
Indexes tb1, tb2 and tb3 will be converted to one relation in OpenSearch and tbl is the alias of this relation.
Spark doesn't support comma-tables as a relation. Catalyst will throw Table not found if a UnresolvedRelation with comma-named table identifier.
source=`tb1, tb2, tb3` as tbl equals to source = tb1, tb2, tb3 as tbl for Spark and fail in resolution because the reason I just said.
For OpenSearch index, source=`tb1, tb2, tb3` as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

Copy link
Member Author

@LantaoJin LantaoJin Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source = tb1, tb2, tb3 as tbl and source = `tb1`, `tb2`, `tb3` as tbl are valid because we never need the pattern
source = tb1 as t1, tb2 as t2, tb3 as t3 since there is no meaningful. tb1 tb2 tb3 are combined in one relation in plan.
That's why source=`tb1, tb2, tb3` as tbl will be treated the relation name "tb1, tb2, tb3" and should fail with table not found.

Copy link
Collaborator

@penghuo penghuo Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current valid syntax is source = tb1, tb2, tb3 as tbl or source = tb1, tb2, tb3 as tbl

is it a valid grammer in spark-sql. If not, does it confuse user?

OpenSearch index, source=tb1, tb2, tb3 as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

PPL on OpenSearch support it, it is multiple opensearch index.

Should we let the Catalog handle table name resolution? for openserach catalog, it can resolve table name as multiple index properly. for instance,catalog.namespace.index-2024*,index-2023-12,

Copy link
Member Author

@LantaoJin LantaoJin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current valid syntax is source = tb1, tb2, tb3 as tbl or source = `tb1`, `tb2`, `tb3` as tbl

is it a valid grammer in spark-sql. If not, does it confuse user?

Yes. It is a valid grammer in opensearch-spark. For example

search source=test1, test2 or search source=`test1`, `test2`

generate a Spark plan with Union

'Union
:- 'Project [*]
:  +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test1], [], false
+- 'Project [*]
   +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test2], [], false

OpenSearch index, source=`tb1, tb2, tb3` as tbl cannot work either since "tb1, tb2, tb3" is not a valid index name.

PPL on OpenSearch support it, it is multiple opensearch index.

Oh, that is the key difference, opensearch-spark can't handle it since "tb1, tb2, tb3" in backticks will be handled as a whole and name with comma is invalid in Spark.

Copy link
Member Author

@LantaoJin LantaoJin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PPL on OpenSearch supports:

  1. source=accounts, account2
  2. source=`accounts`,`account2`
  3. source=`accounts, account2`

But PPL on Spark supports the first two. I would suggest to mark the third as invalid since users will treat the content in backticks as a whole as usual. `accounts, account2` seems more specific for OpenSearch domain. For the instance you provided above, my suggestion is treating content in backticks as a whole. @penghuo

  • √ source=`catalog`.`namespace`.`index-2024*`, `catalog`.`namespace`.`index-2023-12`
  • √ source=`catalog`.`namespace`.index-2024*, index-2023-12
  • × source=`catalog`.`namespace`.`index-2024*, index-2023-12`

Copy link
Member Author

@LantaoJin LantaoJin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any different thoughts? I think it's worth to open a meta issue in sql repo for further discussion if we couldn't get align here. This context in a closed PR could be easily lost.

@LantaoJin
Copy link
Member Author

LantaoJin commented Oct 16, 2024

@penghuo @YANG-DB In commit 1a4451a, I added integ-test and documentation for the case which subquery in search filter. It's to explain how a subquery to rewrite a subsearch in SPL we discussed offline.
SPL:

sourcetype=access_* status=200 action=purchase [search sourcetype=access_* status=200 action=purchase | top limit=1 clientip | table clientip] | stats count, dc(productId), values(productId) by clientip

PPL:

sourcetype=access_* status=200 action=purchase clientip=[search sourcetype=access_* status=200 action=purchase | top limit=1 clientip | fields clientip] | stats count, dc(productId), values(productId) by clientip

Copy link
Member

@YANG-DB YANG-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LantaoJin excellent documentation - thanks !!

@YANG-DB YANG-DB merged commit 9d3909b into opensearch-project:main Oct 17, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support RelationSubquery PPL
3 participants