Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Robustness of Unparser Testing and Implementation #9623

Merged
merged 15 commits into from
Mar 18, 2024

Conversation

devinjdangelo
Copy link
Contributor

@devinjdangelo devinjdangelo commented Mar 15, 2024

Which issue does this PR close?

Follow on to #9606

Rationale for this change

Based on review comment from @seddonm1 I wanted to see if I could make examples like the following pass in unparser:

Sort: COUNT(*) ASC NULLS LAST
  Projection: person.id, COUNT(*), person.first_name
    Filter: COUNT(*) > Int64(5) AND COUNT(*) < Int64(10)
      Aggregate: groupBy=[[person.first_name, person.id]], aggr=[[COUNT(*)]]
        Filter: person.id != Int64(3) AND person.first_name = Utf8("test")
          TableScan: person
select id, count(*), first_name 
from person 
where id!=3 and first_name=='test'
group by first_name, id 
having count(*)>5 and count(*)<10
order by count(*)

While working on this, I found that the implementation in #9606 was not doing what I thought it was doing. I was misled due to the string representation of two different AST's being the same. Namely the literal column expression "COUNT(*)" vs the aggregate expression COUNT(*) were represented by the same strings. This led me to believe the implementation was correct, when in fact is was not and would fail on more complex examples (such as aggregate expressions nested in the expression tree or with quoted identifiers).

I have modified the round-trip testing strategy to avoid similar confusion in the future. Now, instead of looking at the round trip SQL string and comparing, we logically plan the round tripped AST and ensure it generates the same logical plan as the original sql string. E.g.

sql: &str -> s1: ast::Statement -> plan1: LogicalPlan -> s2: ast::Statement -> plan2: LogicalPlan

assert_eq!(plan1, plan2)

The new test condition ensures not that the sql string is literally the same post round trip, but they are logically equivalent. I believe this is the more meaningful condition to test.

I have also spent time trying to come up with more complex queries to break the implementation. The most complete/complex test in the suite is:

select p.id, count("First Name") as count_first_name,
"Last Name", sum(qp.id/p.id - (select sum(id) from person_quoted_cols) ) / (select count(*) from person) 
from (select id, "First Name", "Last Name" from person_quoted_cols) qp
inner join (select * from person) p
on p.id = qp.id
where p.id!=3 and "First Name"=='test' and qp.id in 
(select id from (select id, count(*) from person group by id having count(*) > 0))
group by "Last Name", p.id 
having count_first_name>5 and count_first_name<10
order by count_first_name, "Last Name"

This seed sql string results in the following plan:

Projection: p.id, COUNT(qp.First Name) AS count_first_name, qp.Last Name, SUM(qp.id / p.id - SUM(person_quoted_cols.id)) / (<subquery>)
    Subquery:
      Projection: COUNT(*)
        Aggregate: groupBy=[[]], aggr=[[COUNT(*)]]
          TableScan: person
    Filter: COUNT(qp.First Name) > Int64(5) AND COUNT(qp.First Name) < Int64(10)
      Aggregate: groupBy=[[qp.Last Name, p.id]], aggr=[[COUNT(qp.First Name), SUM(qp.id / p.id - (<subquery>))]]
        Subquery:
          Projection: SUM(person_quoted_cols.id)
            Aggregate: groupBy=[[]], aggr=[[SUM(person_quoted_cols.id)]]
              TableScan: person_quoted_cols
        Filter: p.id != Int64(3) AND qp.First Name = Utf8("test") AND qp.id IN (<subquery>)
          Subquery:
            Projection: person.id
              Projection: person.id, COUNT(*)
                Filter: COUNT(*) > Int64(0)
                  Aggregate: groupBy=[[person.id]], aggr=[[COUNT(*)]]
                    TableScan: person
          Inner Join:  Filter: p.id = qp.id
            SubqueryAlias: qp
              Projection: person_quoted_cols.id, person_quoted_cols.First Name, person_quoted_cols.Last Name
                TableScan: person_quoted_cols
            SubqueryAlias: p
              Projection: person.id, person.first_name, person.last_name, person.age, person.state, person.salary, person.birth_date, person.😀
                TableScan: person

The plan is then transformed back into an ast::Statement and then that statement is planned and compared to be identical to the above plan. This verifies we have preserved all logical query information from the original SQL string in our round trip. Differences in syntax/quoting which do not affect the logic of the query are ignored.

What changes are included in this PR?

  1. Changes testing strategy as described above
  2. Adds utility function for scanning for an Aggregate node that applies to a given Query under construction
  3. Adds utility function for "unprojecting" aggregate expressions, potentially nested deeply within an expression tree.
  4. Adds support for HAVING clause
  5. Adds support for subquery and insubquery expressions
  6. Always quote identifiers (so identifiers with whitespace/special chars work without special handling).

Are these changes tested?

Yes

Are there any user-facing changes?

More complex SQL works with unparser now

@github-actions github-actions bot added the sql SQL Planner label Mar 15, 2024
@devinjdangelo devinjdangelo marked this pull request as draft March 15, 2024 14:44
@devinjdangelo devinjdangelo changed the title Support Roundtrip Logical Planning with HAVING clause wip: Support Roundtrip Logical Planning with HAVING clause Mar 15, 2024
@devinjdangelo devinjdangelo changed the title wip: Support Roundtrip Logical Planning with HAVING clause Improve Robustness of Unparser Testing and Implementation Mar 17, 2024
@devinjdangelo devinjdangelo marked this pull request as ready for review March 17, 2024 13:45
Copy link
Contributor Author

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @backkem, @alamb, @seddonm1 if you have time to review

@@ -151,6 +151,36 @@ impl Unparser<'_> {
order_by: vec![],
}))
}
Expr::ScalarSubquery(subq) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found that all of the information required to preserve a subquery is encoded in the subquery expression. This means we can actually completely ignore subquery LogicalPlan nodes and not lose anything...

At least this is true in every example I have come up with so far 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what you mean by "ignore LogicalPlan" as the Expr has a LogicaPlan in it, doesn't it (sub.subquery)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at Unparser::select_to_sql_recursively, we have not implemented a match condition for LogicalPlan::Subquery. This node is never actually encountered as I have not found a case where you can get to it via the "input" of another node. We do encounter the Expr::ScalarSubquery expression and we encode the information from the LogicalPlan found there instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I wonder if LogicalPlan:Subquery is ever actually used 🤔 I took a quick look around and didn't see an obvious creation in the planner.

@@ -169,7 +199,7 @@ impl Unparser<'_> {
pub(super) fn new_ident(&self, str: String) -> ast::Ident {
ast::Ident {
value: str,
quote_style: self.dialect.identifier_quote_style(),
quote_style: Some(self.dialect.identifier_quote_style().unwrap_or('"')),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identifiers are now always quoted. This makes unparser work for columns with spaces or other strangeness without any other special handling required.

/// into an actual aggregate expression COUNT(*) as identified in the aggregate node.
pub(crate) fn unproject_agg_exprs(expr: &Expr, agg: &Aggregate) -> Result<Expr> {
expr.clone()
.transform(&|sub_expr| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are borrowing the machinery of logical planning / optimization to unwind logical planning. 😆

for (query, expected) in tests {
let actual = roundtrip(query).unwrap();
assert_eq!(actual, expected);
assert_eq!(plan, plan_roundtrip);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the description, the key condition tested is that the logical plan derived from the original AST and recovered AST are the same logical plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense

Can you please add a comment to this test that explains the rationale (that you so eloquently explained in the PR' description)? I think it will help future readers understand the choice which may not be obvious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty great to me -- thank you @devinjdangelo

@@ -151,6 +151,36 @@ impl Unparser<'_> {
order_by: vec![],
}))
}
Expr::ScalarSubquery(subq) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure what you mean by "ignore LogicalPlan" as the Expr has a LogicaPlan in it, doesn't it (sub.subquery)?

),
];
"select id, sum(age), first_name from person group by first_name, id",
"select id, count(*), first_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤯

for (query, expected) in tests {
let actual = roundtrip(query).unwrap();
assert_eq!(actual, expected);
assert_eq!(plan, plan_roundtrip);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense

Can you please add a comment to this test that explains the rationale (that you so eloquently explained in the PR' description)? I think it will help future readers understand the choice which may not be obvious

@seddonm1
Copy link
Contributor

I think your change to the roundtrip test is good as it is testing only the single concern of Statement -> AST rather than the dual concerns of Statement -> AST + sqlparser formatting behavior. I wonder if it makes sense to change expr_to_sql_ok to behave similarly? Theoretically the sqlparser-rs crate has serialisation tests?

@alamb alamb merged commit 269563a into apache:main Mar 18, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 18, 2024

Thanks again @devinjdangelo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants