fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

fhyy · 2024-07-02T14:48:34Z

SUMMARY

Follow up PR after #28444.
The only notable change after that PR is that I shortened the parameter names to properly reflect the usage and to fit within the 30 character limit. This name change also inverts the meaning of the parameter and thus the usage has also been inverted everywhere.

Fixes issue #28443
Renamed db_engine_specs parameter allows_alias_to_source_column to order_by_require_unique_alias and added the new parameter group_by_require_unique_alias. This new name also inverts the meaning of the parameter, but properly reflects its function.

The previous parameter is used to tell the SQLA generator to rename aliases used in ORDER BY statements with aggregations, to ensure that the source column is referenced. Some engines (e.g. Drill) needs to be able to do the same thing for aliases in GROUP BY statements.

The new parameter is used to tell the SQLA generator to rename any alias of a source column that is used in an aggregation in a GROUP BY statement, to ensure that the source column is referenced.

For example this query

SELECT length(n_name) AS n_name
...
GROUP BY length(n_name)

becomes

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

This ensures that the source column n_name is used in the length(n_name) statement.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

The documentation of the new group_by_require_unique_alias parameter has been added, and errors in the documentation of order_by_require_unique_alias/(previously)allows_alias_to_source_column has been fixed.

TESTING INSTRUCTIONS

Start Superset
Navigate to the Superset web application and login
Connect to a Drill database
Create a dataset of the Drill database
Create a chart from that dataset
Select visualization type Table with query mode Aggregate
Add two columns in the dimensions
Aggregate the data of one of the columns (e.g. length(column_a))
Press "view query"
The query should now contain a GROUP BY statement of an aggregation, and the alias of that aggregation should have "__" at the end of the name.
Example:

SELECT length(n_name) AS n_name__
...
GROUP BY length(n_name)

ADDITIONAL INFORMATION

Has associated issue: Fixes Drill uses aliases in group by statements generated by SQLA resulting in no data when aggregating columns #28443
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

The value is true if the engine is able to pick the source column for aggregation clauses used in ORDER BY when a column in SELECT has an alias that is the same as a source column.

* Renamed attribute allows_alias_to_source_column to order_by_allows_alias_to_source_column * Added attribute group_by_allows_alias_to_source_column * Rename aliases for source columns if used in GROUP BY and if group_by_allows_alias_to_source_column is false

The new name is shorter but also more true to the actual usage. Updated documentation and implementation.

rusackas · 2024-07-03T16:35:19Z

Thanks for this! Running CI 🤞

codecov · 2024-07-03T16:40:19Z

Codecov Report

Attention: Patch coverage is 64.70588% with 12 lines in your changes missing coverage. Please review.

Project coverage is 83.85%. Comparing base (76d897e) to head (aa5a308).
Report is 408 commits behind head on master.

Files	Patch %	Lines
superset/models/helpers.py	15.38%	11 Missing ⚠️
superset/connectors/sqla/models.py	91.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #29455       +/-   ##
===========================================
+ Coverage   60.48%   83.85%   +23.36%     
===========================================
  Files        1931      519     -1412     
  Lines       76236    37440    -38796     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    31395    -14719     
+ Misses      28017     6045    -21972     
+ Partials     2105        0     -2105

Flag	Coverage Δ
hive	`49.13% <32.35%> (-0.03%)`	⬇️
javascript	`?`
mysql	`77.15% <44.11%> (?)`
postgres	`77.23% <44.11%> (?)`
presto	`53.78% <61.76%> (-0.02%)`	⬇️
python	`83.85% <64.70%> (+20.36%)`	⬆️
sqlite	`76.71% <44.11%> (?)`
unit	`59.74% <44.11%> (+2.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joeyJsonar · 2024-10-24T02:43:17Z

Just the comment that this works after merging to our branch (we have Apache Drill setup to call our proprietary db that can interface in Mongo syntax).

I know there are pending 340 PR, want to add some plus points on merging this.

Edit:

Heatmap is not working

SELECT
  NEARESTDATE(`Timestamp`, 'DAY') AS `Timestamp__`,
  `DB User Name` AS `DB User Name`,
  count(DISTINCT `Timestamp`) AS `COUNT_DISTINCT(Timestamp)`
FROM `mongo`.`sonar_log`.`audit_log`
GROUP BY
  NEARESTDATE(`Timestamp`, 'DAY'),
  `DB User Name`
ORDER BY
  `Timestamp` ASC,
  `DB User Name` ASC
LIMIT 10000;

fhyy · 2024-10-28T08:18:05Z

Edit:

* [ ]  Heatmap is not working

SELECT
  NEARESTDATE(`Timestamp`, 'DAY') AS `Timestamp__`,
  `DB User Name` AS `DB User Name`,
  count(DISTINCT `Timestamp`) AS `COUNT_DISTINCT(Timestamp)`
FROM `mongo`.`sonar_log`.`audit_log`
GROUP BY
  NEARESTDATE(`Timestamp`, 'DAY'),
  `DB User Name`
ORDER BY
  `Timestamp` ASC,
  `DB User Name` ASC
LIMIT 10000;

Hi @joeyJsonar, can you clarify what you mean by "Heatmap is not working" in this example?

The query does not seem to use any aliases that would break it.
GROUP BY NEARESTDATE('Timestamp', 'DAY') instead of GROUP BY 'Timestamp__' is a bit silly but should not be a problem, and it is not an issue this PR intended to fix (nor cause).

Thanks!

villebro

We already have a mechanism for this in the BaseEngineSpec._mutate_label method Apologies for the name of it, but it does precisely this. For instance, ClickHouse has the same issue, and it's solved by suffixing a snippet of the MD5 of the original label name.

Fredrik Hyyrynen and others added 7 commits June 30, 2024 16:23

Corrected documentation of attribute allows_alias_to_source_column

91fb78b

The value is true if the engine is able to pick the source column for aggregation clauses used in ORDER BY when a column in SELECT has an alias that is the same as a source column.

Corrected documentation of attribute allows_alias_in_orderby

2281a73

Removed trailing whitespace

3dd2ec8

Corrected type of input for groupby expressions

aaf72cd

Shortened the name and inverted meaning of config parameters.

627c2ee

The new name is shorter but also more true to the actual usage. Updated documentation and implementation.

Merge branch 'apache:master' into master

aa5a308

pull-request-size bot added the size/L label Jul 2, 2024

dosubot bot added the sqllab Namespace | Anything related to the SQL Lab label Jul 2, 2024

michael-s-molina requested review from betodealmeida, kgabryje and villebro July 2, 2024 17:19

fhyy changed the title ~~[WIP] fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments~~ fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments Jul 3, 2024

This was referenced Jul 3, 2024

fix: added parameter to optionally rename aliases of aggregated columns used in GROUP BY statments #28444

Closed

Drill uses aliases in group by statements generated by SQLA resulting in no data when aggregating columns #28443

Open

villebro requested changes Oct 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

fhyy commented Jul 2, 2024

rusackas commented Jul 3, 2024

codecov bot commented Jul 3, 2024 •

edited

Loading

joeyJsonar commented Oct 24, 2024 •

edited

Loading

fhyy commented Oct 28, 2024

villebro left a comment

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

Are you sure you want to change the base?

fix: added parameter to rename aliases of aggregated columns used in GROUP BY statments #29455

Conversation

fhyy commented Jul 2, 2024

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

rusackas commented Jul 3, 2024

codecov bot commented Jul 3, 2024 • edited Loading

Codecov Report

joeyJsonar commented Oct 24, 2024 • edited Loading

fhyy commented Oct 28, 2024

villebro left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 3, 2024 •

edited

Loading

joeyJsonar commented Oct 24, 2024 •

edited

Loading