Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

fhyy
Copy link

@fhyy fhyy commented Jul 2, 2024

SUMMARY

Follow up PR after #28444.
The only notable change after that PR is that I shortened the parameter names to properly reflect the usage and to fit within the 30 character limit. This name change also inverts the meaning of the parameter and thus the usage has also been inverted everywhere.

Fixes issue #28443
Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_require_unique_alias and added the new parameter group_by_require_unique_alias. This new name also inverts the meaning of the parameter, but properly reflects its function.

The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.

The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.

For example this query

SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)

becomes

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

This ensures that the source column n_name is used in the length(n_name) statement.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot from 2024-07-02 16-20-05
Screenshot from 2024-07-02 15-48-59

The documentation of the new group_by_require_unique_alias parameter has been added, and errors in the documentation of order_by_require_unique_alias/(previously)allows_alias_to_source_column has been fixed.

TESTING INSTRUCTIONS

  1. Start Superset
  2. Navigate to the Superset web application and login
  3. Connect to a Drill database
  4. Create a dataset of the Drill database
  5. Create a chart from that dataset
  6. Select visualization type Table with query mode Aggregate
  7. Add two columns in the dimensions
  8. Aggregate the data of one of the columns (e.g. length(column_a))
  9. Press "view query"
  10. The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name.
    Example:
SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

ADDITIONAL INFORMATION

Fredrik Hyyrynen and others added 7 commits June 30, 2024 16:23
The value is true if the engine is able to pick the source column
for aggregation clauses used in ORDER BY when a column in SELECT has
an alias that is the same as a source column.
* Renamed attribute allows_alias_to_source_column to
  order_by_allows_alias_to_source_column

* Added attribute group_by_allows_alias_to_source_column

* Rename aliases for source columns if used in GROUP BY and if
  group_by_allows_alias_to_source_column is false
The new name is shorter but also more true to the actual usage.
Updated documentation and implementation.
@dosubot dosubot bot added the sqllab Namespace | Anything related to the SQL Lab label Jul 2, 2024
@fhyy fhyy changed the title [WIP] fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments Jul 3, 2024
@rusackas
Copy link
Member

rusackas commented Jul 3, 2024

Thanks for this! Running CI 🤞

Copy link

codecov bot commented Jul 3, 2024

Codecov Report

Attention: Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (76d897e) to head (aa5a308).
Report is 408 commits behind head on master.

Files Patch % Lines
superset/models/helpers.py 15.38% 11 Missing ⚠️
superset/connectors/sqla/models.py 91.66% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #29455       +/-   ##
===========================================
+ Coverage   60.48%   83.85%   +23.36%     
===========================================
  Files        1931      519     -1412     
  Lines       76236    37440    -38796     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31395    -14719     
+ Misses      28017     6045    -21972     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 49.13% <32.35%> (-0.03%) ⬇️
javascript ?
mysql 77.15% <44.11%> (?)
postgres 77.23% <44.11%> (?)
presto 53.78% <61.76%> (-0.02%) ⬇️
python 83.85% <64.70%> (+20.36%) ⬆️
sqlite 76.71% <44.11%> (?)
unit 59.74% <44.11%> (+2.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@joeyJsonar
Copy link

joeyJsonar commented Oct 24, 2024

Just the comment that this works after merging to our branch (we have Apache Drill setup to call our proprietary db that can interface in Mongo syntax).

I know there are pending 340 PR, want to add some plus points on merging this.

Edit:

  • Heatmap is not working
SELECT
  NEARESTDATE(`Timestamp`, 'DAY') AS `Timestamp__`,
  `DB User Name` AS `DB User Name`,
  count(DISTINCT `Timestamp`) AS `COUNT_DISTINCT(Timestamp)`
FROM `mongo`.`sonar_log`.`audit_log`
GROUP BY
  NEARESTDATE(`Timestamp`, 'DAY'),
  `DB User Name`
ORDER BY
  `Timestamp` ASC,
  `DB User Name` ASC
LIMIT 10000;

@fhyy
Copy link
Author

fhyy commented Oct 28, 2024

Edit:

* [ ]  Heatmap is not working
SELECT
  NEARESTDATE(`Timestamp`, 'DAY') AS `Timestamp__`,
  `DB User Name` AS `DB User Name`,
  count(DISTINCT `Timestamp`) AS `COUNT_DISTINCT(Timestamp)`
FROM `mongo`.`sonar_log`.`audit_log`
GROUP BY
  NEARESTDATE(`Timestamp`, 'DAY'),
  `DB User Name`
ORDER BY
  `Timestamp` ASC,
  `DB User Name` ASC
LIMIT 10000;

Hi @joeyJsonar, can you clarify what you mean by "Heatmap is not working" in this example?

The query does not seem to use any aliases that would break it.
GROUP BY NEARESTDATE('Timestamp', 'DAY') instead of GROUP BY 'Timestamp__' is a bit silly but should not be a problem, and it is not an issue this PR intended to fix (nor cause).

Thanks!

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a mechanism for this in the BaseEngineSpec._mutate_label method Apologies for the name of it, but it does precisely this. For instance, ClickHouse has the same issue, and it's solved by suffixing a snippet of the MD5 of the original label name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L sqllab Namespace | Anything related to the SQL Lab
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drill uses aliases in group by statements generated by SQLA resulting in no data when aggregating columns
4 participants