fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

fhyy · 2024-05-12T18:01:00Z

SUMMARY

Fixed issue #28443
Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_allows_alias_to_source_column and added the new parameter group_by_allows_alias_to_source_column.

The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.

The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.

Added documentation of the new group_by_allows_alias_to_source_column parameter, and fixed errors in the documentation of order_by_allows_alias_to_source_column/(previously)allows_alias_to_source_column

For example this query

SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)

becomes

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

Start Superset
Navigate to the Superset web application and login
Connect to a Drill database
Create a dataset of the Drill database
Create a chart from that dataset
Select visualization type Table with query mode Aggregate
Add two columns in the dimensions
Aggregate the data of one of the columns (e.g. length(column_a))
Press "view query"
The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name.
Example:

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

ADDITIONAL INFORMATION

Has associated issue: Fixes Drill uses aliases in group by statements generated by SQLA resulting in no data when aggregating columns #28443
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

github-actions

Congrats on making your first PR and thank you for contributing to Superset! 🎉 ❤️

We hope to see you in our Slack community too! Not signed up? Use our Slack App to self-register.

villebro · 2024-05-12T18:41:45Z

@fhyy thanks for the PR! I believe some other engines have tackled this with the _mutate_label function. For instance, you can check how the ClickHouse spec does this (the method being private is slightly confusing, and this should probably be fixed). Can you check if adding similar logic to the Drill spec would solve this issue?

codecov · 2024-05-12T18:42:46Z

Codecov Report

Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.15%. Comparing base (4720b4f) to head (851f5e5).
Report is 567 commits behind head on master.

Files	Patch %	Lines
superset/models/helpers.py	6.25%	15 Missing ⚠️
superset/connectors/sqla/models.py	83.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #28444      +/-   ##
==========================================
+ Coverage   69.76%   70.15%   +0.39%     
==========================================
  Files        1911     1921      +10     
  Lines       74994    76157    +1163     
  Branches     8353     8353              
==========================================
+ Hits        52316    53427    +1111     
- Misses      20629    20681      +52     
  Partials     2049     2049

Flag	Coverage Δ
hive	`49.01% <42.22%> (?)`
mysql	`?`
postgres	`77.14% <48.88%> (-0.88%)`	⬇️
presto	`53.59% <60.00%> (?)`
python	`83.30% <62.22%> (+0.39%)`	⬆️
sqlite	`76.59% <48.88%> (-0.87%)`	⬇️
unit	`58.74% <46.66%> (+1.96%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

john-bodley · 2024-05-13T16:58:51Z

Thanks @fhyy for the PR. My thinking was the underlying SQLAlchemy dialect should handle the various nuances in terms of how/where aliases are defined.

rusackas · 2024-05-13T21:10:45Z

CC @cgivre

fhyy · 2024-05-15T12:30:46Z

Thanks for the quick feedback!

@villebro I tried the _mutate_label function and it also solves the issue, but I'm not sure if I agree that this is the solution to the actual problem. If I understand that function correctly, it is used to modify labels so that they are compatible with the different engines. An engine could require lowercase letters, or starting with an underscore, etc.

Drill supports the aliases as they are, it just behaves unexpectedly with GROUP BY statements as Drill added support for the use of aliases here.

I agree with @john-bodley that the dialects should handle it instead. A boolean parameter isn't as flexible as it may need to be. In this case I would also propose updating the previous allows_alias_to_source_column parameter as well.

Thoughts on this?

cgivre · 2024-05-15T13:18:50Z

I know I'm a little late to this conversation, and I'm happy to help out where I can, but I'm very unclear as to what the actual issue is. Is the fact that Drill supports aliases in GROUP BY breaking things?

fhyy · 2024-05-15T14:08:40Z

Yes. The fact that Drill supports aliases in GROUP BY makes this example query ambiguous:

SELECT length(n_name) AS n_name
FROM
  (select * from cp.`tpch/nation.parquet`)
GROUP BY length(n_name)
LIMIT 10;

Drill will use the alias n_name in GROUP BY length(n_name) which results in the actual statement GROUP BY length(length(n_name)).

Renaming such aliases solves the problem, but not all aliases needs to be renamed.

You can find more information in issue #28443

mbrannstrom · 2024-06-03T05:54:37Z

@cgivre : See my comment on issue 20349 for what the problem is with GROUP BY in Apache Drill.

The value is true if the engine is able to pick the source column for aggregation clauses used in ORDER BY when a column in SELECT has an alias that is the same as a source column.

* Renamed attribute allows_alias_to_source_column to order_by_allows_alias_to_source_column * Added attribute group_by_allows_alias_to_source_column * Rename aliases for source columns if used in GROUP BY and if group_by_allows_alias_to_source_column is false

fhyy · 2024-06-30T14:37:46Z

Closing this for now. I'm working on fixing the issues with the tests and also updating my fork to the latest version. I managed to break this PR in the process...

fhyy · 2024-07-03T08:03:43Z

I created a new pull request with a slightly different implementation and based of the newer version:
#29455

pull-request-size bot added the size/L label May 12, 2024

fhyy changed the title ~~fix: added parameer to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443)~~ fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443) May 12, 2024

fhyy changed the title ~~fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443)~~ fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments May 12, 2024

github-actions bot reviewed May 12, 2024

View reviewed changes

mbrannstrom mentioned this pull request Jun 3, 2024

Time granularity generates invalid query for Dremio #20349

Open

3 tasks

rusackas requested review from betodealmeida, john-bodley and villebro June 3, 2024 18:12

fhyy force-pushed the master branch from 13bdcb6 to 851f5e5 Compare June 12, 2024 08:59

Fredrik Hyyrynen added 4 commits June 30, 2024 16:23

Corrected documentation of attribute allows_alias_to_source_column

91fb78b

The value is true if the engine is able to pick the source column for aggregation clauses used in ORDER BY when a column in SELECT has an alias that is the same as a source column.

Corrected documentation of attribute allows_alias_in_orderby

2281a73

Removed trailing whitespace

3dd2ec8

fhyy force-pushed the master branch from 851f5e5 to 3dd2ec8 Compare June 30, 2024 14:29

fhyy closed this Jun 30, 2024

fhyy mentioned this pull request Jul 2, 2024

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

fhyy commented May 12, 2024

github-actions bot left a comment

villebro commented May 12, 2024

codecov bot commented May 12, 2024 •

edited

Loading

john-bodley commented May 13, 2024

rusackas commented May 13, 2024

fhyy commented May 15, 2024

cgivre commented May 15, 2024

fhyy commented May 15, 2024

mbrannstrom commented Jun 3, 2024

fhyy commented Jun 30, 2024

fhyy commented Jul 3, 2024

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

Conversation

fhyy commented May 12, 2024

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

github-actions bot left a comment

Choose a reason for hiding this comment

villebro commented May 12, 2024

codecov bot commented May 12, 2024 • edited Loading

Codecov Report

john-bodley commented May 13, 2024

rusackas commented May 13, 2024

fhyy commented May 15, 2024

cgivre commented May 15, 2024

fhyy commented May 15, 2024

mbrannstrom commented Jun 3, 2024

fhyy commented Jun 30, 2024

fhyy commented Jul 3, 2024

codecov bot commented May 12, 2024 •

edited

Loading