Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[enhancement](Nereids) check multiple distinct functions that cannot be transformed into muti_distinct #21626

Merged
merged 1 commit into from
Jul 24, 2023

Conversation

keanji-x
Copy link
Contributor

@keanji-x keanji-x commented Jul 7, 2023

Proposed changes

This commit introduces a transformation for SQL queries that contain multiple distinct aggregate functions. When the number of distinct values processed by these functions is greater than 1, they are converted into multi_distinct functions for more efficient handling.

Example:

SELECT COUNT(DISTINCT c1), SUM(DISTINCT c2) FROM tbl GROUP BY c3
-- Transformed to
SELECT MULTI_DISTINCT_COUNT(c1), MULTI_DISTINCT_SUM(c2) FROM tbl GROUP BY c3

The following functions can be transformed:

  • COUNT
  • SUM
  • AVG
  • GROUP_CONCAT

If any unsupported functions are encountered, an error is now reported during the optimization phase.

To ensure the absence of such cases, a final check has been implemented after the rewriting phase.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@keanji-x
Copy link
Contributor Author

keanji-x commented Jul 7, 2023

run buildall

@keanji-x keanji-x force-pushed the check_multi_distinct branch from 91e691f to 6a398f8 Compare July 7, 2023 09:42
@keanji-x keanji-x changed the title Multiple distinct aggregate functions that cannot be transformed into… [enhancement](Nereids) check multiple distinct functions that cannot be transformed into muti_distinct Jul 7, 2023
.filter(AggregateFunction::isDistinct)
.collect(Collectors.toList());

Set<Expression> arguments = distinctFuncs.stream().flatMap(expr -> expr.children().stream())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a helper function to do this: agg.getDistinctArguments();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@keanji-x keanji-x force-pushed the check_multi_distinct branch from 6a398f8 to 86b010b Compare July 7, 2023 10:01
@keanji-x
Copy link
Contributor Author

keanji-x commented Jul 7, 2023

run buildall

@keanji-x keanji-x force-pushed the check_multi_distinct branch 3 times, most recently from 768764c to c684e87 Compare July 7, 2023 10:12
@keanji-x
Copy link
Contributor Author

keanji-x commented Jul 7, 2023

run buildall

@keanji-x keanji-x force-pushed the check_multi_distinct branch from c684e87 to ba4dc7a Compare July 19, 2023 03:59
@keanji-x
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 53.49 seconds
stream load tsv: 509 seconds loaded 74807831229 Bytes, about 140 MB/s
stream load json: 19 seconds loaded 2358488459 Bytes, about 118 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.9 seconds inserted 10000000 Rows, about 334K ops/s
storage size: 17168841502 Bytes

@keanji-x keanji-x force-pushed the check_multi_distinct branch from ba4dc7a to e9978f8 Compare July 20, 2023 06:19
@keanji-x
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.11 seconds
stream load tsv: 512 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.5 seconds inserted 10000000 Rows, about 338K ops/s
storage size: 17166027937 Bytes

@keanji-x
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.35 seconds
stream load tsv: 511 seconds loaded 74807831229 Bytes, about 139 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 30 seconds loaded 861443392 Bytes, about 27 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17165683922 Bytes

@keanji-x
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.36 seconds
stream load tsv: 508 seconds loaded 74807831229 Bytes, about 140 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 29.4 seconds inserted 10000000 Rows, about 340K ops/s
storage size: 17166526601 Bytes

@morrySnow morrySnow added the dev/2.0.0 2.0.0 release label Jul 24, 2023
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 24, 2023
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morrySnow morrySnow merged commit 68bd4a1 into apache:master Jul 24, 2023
@xiaokang xiaokang added dev/2.0.0-merged and removed dev/2.0.0 2.0.0 release labels Jul 24, 2023
xiaokang pushed a commit that referenced this pull request Jul 24, 2023
…formed into muti_distinct (#21626)

This commit introduces a transformation for SQL queries that contain multiple distinct aggregate functions. When the number of distinct values processed by these functions is greater than 1, they are converted into multi_distinct functions for more efficient handling.

Example:
```
SELECT COUNT(DISTINCT c1), SUM(DISTINCT c2) FROM tbl GROUP BY c3
-- Transformed to
SELECT MULTI_DISTINCT_COUNT(c1), MULTI_DISTINCT_SUM(c2) FROM tbl GROUP BY c3
```

The following functions can be transformed:
- COUNT
- SUM
- AVG
- GROUP_CONCAT

If any unsupported functions are encountered, an error is now reported during the optimization phase.

To ensure the absence of such cases, a final check has been implemented after the rewriting phase.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. area/nereids dev/2.0.0-merged kind/test reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants