Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support SQL filter clause for aggregate expressions, add SQL dialect support #5868

Merged
merged 15 commits into from
Apr 11, 2023

Conversation

yjshen
Copy link
Member

@yjshen yjshen commented Apr 4, 2023

Which issue does this PR close?

Closes #5873.
Closes #5608
Closes #2214

Rationale for this change

This pull request introduces support for the FILTER (WHERE) clause in aggregate expressions. This feature enables users to filter the rows that are considered for aggregation, similar to how it is done in popular SQL databases such as PostgreSQL, SQLite, Spark, and Hive.

SELECT
  department,
  SUM(revenue) AS total_revenue,
  SUM(revenue) FILTER (WHERE product_type = 'Electronics') AS electronics_revenue,
  SUM(revenue) FILTER (WHERE product_type = 'Furniture') AS furniture_revenue
FROM
  sales_data
GROUP BY
  department;

What changes are included in this PR?

  1. The physical_plan/aggregate module is where the majority of the work for this project was completed.
  2. Additionally, there were changes made to the mechanism that routes the optional filter through the optimizer and execution code path.

Are these changes tested?

New tests were added in group_by.rs to cover various scenarios using the FILTER (WHERE) clause with different situations.

Are there any user-facing changes?

Yes.

  1. Users can now use the FILTER (WHERE) clause in aggregate expressions in their SQL queries, providing more flexibility and control over the aggregation. This change is backward compatible, and existing queries without the FILTER (WHERE) clause should continue to work as expected.
  2. Users can now choose the SQL dialect to be used by the SQL parser.

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sql SQL Planner labels Apr 4, 2023
@github-actions github-actions bot added the optimizer Optimizer rules label Apr 4, 2023
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 4, 2023
@alamb
Copy link
Contributor

alamb commented Apr 5, 2023

I plan to review this PR tomorrow. Thank you @yjshen

@alamb alamb changed the title feat: Support SQL filter clause for aggregate expressions feat: Support SQL filter clause for aggregate expressions, add SQL dialect support Apr 6, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @yjshen -- I think this is looking quite nice 👌

I think prior to merge this PR needs:

  1. Some more test cases ( I left comments about which ones)
  2. We should reconsider changing the Aggregator trait ( I left comments about why)

Again, thank you very much -- this is very neat.

cc @poonai

datafusion/common/src/config.rs Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Outdated Show resolved Hide resolved
datafusion/core/src/execution/context.rs Show resolved Hide resolved
datafusion/core/tests/sql/group_by.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/group_by.rs Outdated Show resolved Hide resolved
datafusion/core/tests/sql/group_by.rs Outdated Show resolved Hide resolved
datafusion/physical-expr/src/aggregate/mod.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Apr 6, 2023

cc @andygrove @Dandandan and @jdye64

@jdye64
Copy link
Contributor

jdye64 commented Apr 6, 2023

@alamb thank you for the heads up. Created an issue to make sure we accommodate for these changes. 316

@github-actions github-actions bot removed the physical-expr Physical Expressions label Apr 7, 2023
@yjshen
Copy link
Member Author

yjshen commented Apr 7, 2023

Thank you @alamb for the detailed review! I have made the following updates to the PR based on your feedback:

  1. The optional filter has been moved from aggregate expressions into AggregateExec.
  2. All related tests have been moved into .slt. I added additional tests according to your review comments and tested against PostgreSQL results.
  3. Miscellaneous dialect configuration changes were made per your comments. By the way, it would be great if create_dialect_from_str could be moved upstream.

This PR is now ready for further review. Thank you again, @alamb!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me @yjshen -- thank you. I will upstream the parsing for dialect names now.

The only other thing I really want to do prior to merging this PR is to verify it doesn't change performance. I don't expect that it will but I want to double check to be sure

// 1.2
let batch = match filter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Apr 10, 2023

cc @tustvold @mustafasrepo @crepererum and @Dandandan who I think are all interested in grouping performance

@yjshen
Copy link
Member Author

yjshen commented Apr 11, 2023

I benched aggregate_query_sql.rs and confirmed that this has no noticeable effect on the current benchmarks.

@alamb
Copy link
Contributor

alamb commented Apr 11, 2023

I ran this branch against main using https://github.com/alamb/datafusion-benchmarking and I see no performance difference 👍

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃ /home/alamb… ┃ /home/alamb… ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │    1700.14ms │    1674.54ms │ no change │
│ QQuery 2     │     495.96ms │     476.32ms │ no change │
│ QQuery 3     │     566.94ms │     574.48ms │ no change │
│ QQuery 4     │     238.05ms │     233.97ms │ no change │
│ QQuery 5     │     776.22ms │     747.68ms │ no change │
│ QQuery 6     │     429.90ms │     434.80ms │ no change │
│ QQuery 7     │    1366.64ms │    1346.81ms │ no change │
│ QQuery 8     │     756.20ms │     758.10ms │ no change │
│ QQuery 9     │    1244.81ms │    1243.00ms │ no change │
│ QQuery 10    │     879.18ms │     851.17ms │ no change │
│ QQuery 11    │     412.81ms │     421.18ms │ no change │
│ QQuery 12    │     346.56ms │     340.06ms │ no change │
│ QQuery 13    │    1390.35ms │    1373.72ms │ no change │
│ QQuery 14    │     455.99ms │     447.49ms │ no change │
│ QQuery 15    │     462.05ms │     446.96ms │ no change │
│ QQuery 16    │     356.55ms │     350.45ms │ no change │
│ QQuery 17    │    6630.39ms │    6682.32ms │ no change │
│ QQuery 18    │    4040.30ms │    4029.81ms │ no change │
│ QQuery 19    │     753.73ms │     780.20ms │ no change │
│ QQuery 20    │    1948.00ms │    1903.14ms │ no change │
│ QQuery 21    │    1865.89ms │    1935.56ms │ no change │
│ QQuery 22    │     211.15ms │     218.02ms │ no change │
└──────────────┴──────────────┴──────────────┴───────────┘

@yjshen yjshen merged commit dafe997 into apache:main Apr 11, 2023
korowa pushed a commit to korowa/arrow-datafusion that referenced this pull request Apr 13, 2023
@alamb
Copy link
Contributor

alamb commented Jun 9, 2023

#6616 -- PR to use upstreamed version of parse_sql_dialect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
3 participants