Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

Closed
wants to merge 4 commits into from

Conversation

fhyy
Copy link

@fhyy fhyy commented May 12, 2024

SUMMARY

Fixed issue #28443
Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_allows_alias_to_source_column and added the new parameter group_by_allows_alias_to_source_column.

The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.

The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.

Added documentation of the new group_by_allows_alias_to_source_column parameter, and fixed errors in the documentation of order_by_allows_alias_to_source_column/(previously)allows_alias_to_source_column

For example this query

SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)

becomes

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot from 2024-05-12 19-37-24
Screenshot from 2024-05-12 19-39-17

TESTING INSTRUCTIONS

  1. Start Superset
  2. Navigate to the Superset web application and login
  3. Connect to a Drill database
  4. Create a dataset of the Drill database
  5. Create a chart from that dataset
  6. Select visualization type Table with query mode Aggregate
  7. Add two columns in the dimensions
  8. Aggregate the data of one of the columns (e.g. length(column_a))
  9. Press "view query"
  10. The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name.
    Example:
SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

ADDITIONAL INFORMATION

@fhyy fhyy changed the title fix: added parameer to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443) fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443) May 12, 2024
@fhyy fhyy changed the title fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments (#28443) fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments May 12, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on making your first PR and thank you for contributing to Superset! 🎉 ❤️

We hope to see you in our Slack community too! Not signed up? Use our Slack App to self-register.

@villebro
Copy link
Member

@fhyy thanks for the PR! I believe some other engines have tackled this with the _mutate_label function. For instance, you can check how the ClickHouse spec does this (the method being private is slightly confusing, and this should probably be fixed). Can you check if adding similar logic to the Drill spec would solve this issue?

Copy link

codecov bot commented May 12, 2024

Codecov Report

Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.15%. Comparing base (4720b4f) to head (851f5e5).
Report is 567 commits behind head on master.

Files Patch % Lines
superset/models/helpers.py 6.25% 15 Missing ⚠️
superset/connectors/sqla/models.py 83.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #28444      +/-   ##
==========================================
+ Coverage   69.76%   70.15%   +0.39%     
==========================================
  Files        1911     1921      +10     
  Lines       74994    76157    +1163     
  Branches     8353     8353              
==========================================
+ Hits        52316    53427    +1111     
- Misses      20629    20681      +52     
  Partials     2049     2049              
Flag Coverage Δ
hive 49.01% <42.22%> (?)
mysql ?
postgres 77.14% <48.88%> (-0.88%) ⬇️
presto 53.59% <60.00%> (?)
python 83.30% <62.22%> (+0.39%) ⬆️
sqlite 76.59% <48.88%> (-0.87%) ⬇️
unit 58.74% <46.66%> (+1.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@john-bodley
Copy link
Member

Thanks @fhyy for the PR. My thinking was the underlying SQLAlchemy dialect should handle the various nuances in terms of how/where aliases are defined.

@rusackas
Copy link
Member

CC @cgivre

@fhyy
Copy link
Author

fhyy commented May 15, 2024

Thanks for the quick feedback!

@villebro I tried the _mutate_label function and it also solves the issue, but I'm not sure if I agree that this is the solution to the actual problem. If I understand that function correctly, it is used to modify labels so that they are compatible with the different engines. An engine could require lowercase letters, or starting with an underscore, etc.

Drill supports the aliases as they are, it just behaves unexpectedly with GROUP BY statements as Drill added support for the use of aliases here.

I agree with @john-bodley that the dialects should handle it instead. A boolean parameter isn't as flexible as it may need to be. In this case I would also propose updating the previous allows_alias_to_source_column parameter as well.

Thoughts on this?

@cgivre
Copy link
Contributor

cgivre commented May 15, 2024

I know I'm a little late to this conversation, and I'm happy to help out where I can, but I'm very unclear as to what the actual issue is. Is the fact that Drill supports aliases in GROUP BY breaking things?

@fhyy
Copy link
Author

fhyy commented May 15, 2024

Yes. The fact that Drill supports aliases in GROUP BY makes this example query ambiguous:

SELECT length(n_name) AS n_name
FROM
  (select * from cp.`tpch/nation.parquet`)
GROUP BY length(n_name)
LIMIT 10;

Drill will use the alias n_name in GROUP BY length(n_name) which results in the actual statement GROUP BY length(length(n_name)).

Renaming such aliases solves the problem, but not all aliases needs to be renamed.

You can find more information in issue #28443

@mbrannstrom
Copy link

@cgivre : See my comment on issue 20349 for what the problem is with GROUP BY in Apache Drill.

Fredrik Hyyrynen added 4 commits June 30, 2024 16:23
The value is true if the engine is able to pick the source column
for aggregation clauses used in ORDER BY when a column in SELECT has
an alias that is the same as a source column.
* Renamed attribute allows_alias_to_source_column to
  order_by_allows_alias_to_source_column

* Added attribute group_by_allows_alias_to_source_column

* Rename aliases for source columns if used in GROUP BY and if
  group_by_allows_alias_to_source_column is false
@fhyy
Copy link
Author

fhyy commented Jun 30, 2024

Closing this for now. I'm working on fixing the issues with the tests and also updating my fork to the latest version. I managed to break this PR in the process...

@fhyy
Copy link
Author

fhyy commented Jul 3, 2024

I created a new pull request with a slightly different implementation and based of the newer version:
#29455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drill uses aliases in group by statements generated by SQLA resulting in no data when aggregating columns
6 participants