Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Group concat support order by and distinct #28778

Merged
merged 22 commits into from
Aug 24, 2023

Conversation

fzhedu
Copy link
Contributor

@fzhedu fzhedu commented Aug 7, 2023

support group_concat(distinct x1, x2 order by y1,y2, separator s)

the arguments are listed as : x1, x2, s, y1, y2, output x1, x2, s at last.
the distinct just works on x1, x2, and reject null on x1, x2.

mysql> select group_concat(name), group_concat(distinct name order by 1 separator '/') from ss group by id order by 1;
+------------------------------------------------+-------------------------------------------------------------+
| group_concat(name SEPARATOR ',')               | group_concat(DISTINCT name ORDER BY name ASC SEPARATOR '/') |
+------------------------------------------------+-------------------------------------------------------------+
| NULL                                           | NULL                                                        |
| May,Ti,欧阳诸葛方程                            | May/Ti/欧阳诸葛方程                                         |
| Ti                                             | Ti                                                          |
| Tom,Tom                                        | Tom                                                         |
| Tom,Tom,王武程咬金                             | Tom/王武程咬金                                              |
| 张三此地无银三百两,张三掩耳盗铃                | 张三掩耳盗铃/张三此地无银三百两                             |
| 李四大闹天空                                   | 李四大闹天空                                                |
+------------------------------------------------+-------------------------------------------------------------+
7 rows in set (0.08 sec)

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr will affect users' behaviors
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.1
    • 3.0
    • 2.5
    • 2.4

@fzhedu fzhedu requested a review from a team as a code owner August 7, 2023 14:58
@mergify mergify bot assigned fzhedu Aug 7, 2023
@wanpengfei-git wanpengfei-git added the documentation Improvements or additions to documentation label Aug 9, 2023
Comment on lines 147 to 149
-- result:
-5711937174881
-5714598445053
-- !result
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

group_concat 'result ‘4, 4’ -> '4,4' size is changed from 4 to 3.

Returns a VARCHAR value.
Returns a string value for each group, but returns NULL if there are no non-NULL values.

set `group_concat_max_len` to limit the length of output string from a group, its default value is 1024, minimal value is 4.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give an example to explain how to use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

DCHECK(state.output_col_num > 0);
for (auto i = 0; i < state.output_col_num; ++i) {
if (UNLIKELY(!is_string_type(ctx->get_arg_type(i)->type))) {
ctx->set_error(fmt::format("{}-th input of group_concat should be string type.", i + 1).c_str(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior of this? this should not check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safety check at create, if error, the agg will report error and stop.

// redundancy columns in intermediate results. For example, group_concat(a,b order by 1,2) is rewritten to
// group_concat(cast(a to string), cast(b to string) order by a, b), resulting to keeping 4 columns, but it only needs
// keep 2 columns in intermediate results.
// 3. refactor order-by and distinct function to a combinator to clean the code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. skip to order by a if a is already sorted?
    group_concat(a order by 1) c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be impossible in hash partition mode, as a is distributed on several node.

class GroupConcatAggregateFunctionV2
: public AggregateFunctionBatchHelper<GroupConcatAggregateStateV2, GroupConcatAggregateFunctionV2> {
public:
// group_concat(a, b order by c, d), the arguments are a,b,',',c,d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need extra , column ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, is the separator, we support is as a varable. If not store as a column, we may need other new way.

if (ctx->get_is_distinct()) {
for (auto row_id = 0; row_id < elem_size; row_id++) {
bool is_duplicated = false;
for (auto next_id = row_id + 1; next_id < elem_size; next_id++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Maybe use hashset to avoid repeat compare?
  2. What if the distinct column has been sorted above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the final resut usually is not large after the global distinct, so I let it as a TODO.

state_impl.release_order_by_columns();
DCHECK(ctx->state()->cancelled_ref() || st.ok());
for (auto i = 0; i < output_col_num; ++i) {
materialize_column_by_permutation(outputs[i].get(), {(*state_impl.data_columns)[i]}, perm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that possible late materialize column in the final output?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is determited by the repeated ratio for a chunk. if more repeated tuples, do distinct first is better, otherwise sort first is better.

@@ -134,6 +134,12 @@ static const AggregateFunction* get_function(const std::string& name, LogicalTyp
}
}

if (func_version > 6) {
if (name == "group_concat") {
func_name = "group_concat2";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the performance will get worse when has none orderby?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference lay at the intermediate results, previous way V1 just concat all strings per group, but the new way V2 store intermediate strings in struct{array[]}, with extra array's offsets costs, one offset per group. So the cost may be not large if group is not large, otherwise not.

@wanpengfei-git
Copy link
Collaborator

[FE PR Coverage Check]

😍 pass : 56 / 62 (90.32%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/sql/optimizer/rule/transformation/SplitAggregateRule.java 3 5 60.00% [320, 321]
🔵 com/starrocks/sql/analyzer/FunctionAnalyzer.java 4 6 66.67% [126, 135]
🔵 com/starrocks/sql/optimizer/rule/transformation/RewriteMultiDistinctByCTERule.java 3 4 75.00% [289]
🔵 com/starrocks/sql/analyzer/ExpressionAnalyzer.java 20 21 95.24% [1246]
🔵 com/starrocks/catalog/AggregateFunction.java 7 7 100.00% []
🔵 com/starrocks/qe/SessionVariable.java 1 1 100.00% []
🔵 com/starrocks/catalog/FunctionSet.java 2 2 100.00% []
🔵 com/starrocks/analysis/FunctionCallExpr.java 3 3 100.00% []
🔵 com/starrocks/analysis/FunctionParams.java 1 1 100.00% []
🔵 com/starrocks/sql/analyzer/AstToStringBuilder.java 11 11 100.00% []
🔵 com/starrocks/sql/optimizer/operator/AggType.java 1 1 100.00% []

packy92
packy92 previously approved these changes Aug 21, 2023
@fzhedu
Copy link
Contributor Author

fzhedu commented Aug 22, 2023

admit test failed due to changing output formats, will fixed by https://github.com/StarRocks/StarRocksTest/pull/3738

satanson
satanson previously approved these changes Aug 23, 2023

void update_batch_single_state(FunctionContext* ctx, size_t chunk_size, const Column** columns,
AggDataPtr __restrict state) const override {
GroupConcatAggregateStateV2& state_impl = this->data(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use derived template type parameter instead concrete type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// group_concat(a, b order by c, d), the arguments are a,b,',',c,d
void create_impl(FunctionContext* ctx, GroupConcatAggregateStateV2& state) const {
auto num = ctx->get_num_args();
state.data_columns = new Columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use unique_ptr or shared_ptr instead raw pointer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// just copy the first const value.
data_col = down_cast<const ConstColumn*>(columns[i])->data_column().get();
tmp_row_num = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing branch for processing NullableColumn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullable column can be update in the state, as the data columns in state are nullable.

fzhedu added 15 commits August 23, 2023 11:11
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
…t col

Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
fzhedu added 2 commits August 23, 2023 11:13
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
fzhedu added 2 commits August 23, 2023 11:27
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
@sonarqubecloud
Copy link

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell B 24 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

warning The version of Java (11.0.20) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

idea Catch issues before they fail your Quality Gate with our IDE extension sonarlint SonarLint

@wanpengfei-git
Copy link
Collaborator

[FE Incremental Coverage Report]

😍 pass : 59 / 62 (95.16%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/sql/optimizer/rule/transformation/RewriteMultiDistinctByCTERule.java 3 4 75.00% [289]
🔵 com/starrocks/sql/analyzer/FunctionAnalyzer.java 5 6 83.33% [126]
🔵 com/starrocks/sql/analyzer/ExpressionAnalyzer.java 20 21 95.24% [1246]
🔵 com/starrocks/catalog/AggregateFunction.java 7 7 100.00% []
🔵 com/starrocks/sql/optimizer/rule/transformation/SplitAggregateRule.java 5 5 100.00% []
🔵 com/starrocks/qe/SessionVariable.java 1 1 100.00% []
🔵 com/starrocks/catalog/FunctionSet.java 2 2 100.00% []
🔵 com/starrocks/analysis/FunctionCallExpr.java 3 3 100.00% []
🔵 com/starrocks/analysis/FunctionParams.java 1 1 100.00% []
🔵 com/starrocks/sql/analyzer/AstToStringBuilder.java 11 11 100.00% []
🔵 com/starrocks/sql/optimizer/operator/AggType.java 1 1 100.00% []

@wanpengfei-git
Copy link
Collaborator

[BE Incremental Coverage Report]

😞 fail : 189 / 275 (68.73%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 src/exprs/agg/nullable_aggregate.h 1 7 14.29% [751, 779, 780, 781, 875, 876]
🔵 src/exprs/agg/factory/aggregate_factory.cpp 1 3 33.33% [138, 139]
🔵 src/exprs/agg/group_concat.h 175 251 69.72% [316, 358, 359, 363, 364, 367, 376, 377, 378, 379, 380, 391, 432, 435, 436, 437, 439, 440, 441, 444, 445, 450, 451, 452, 453, 456, 457, 459, 460, 461, 462, 465, 466, 467, 470, 471, 472, 473, 474, 516, 517, 520, 534, 594, 595, 601, 602, 632, 633, 643, 644, 645, 646, 647, 648, 649, 650, 653, 654, 655, 658, 664, 681, 688, 689, 697, 698, 699, 700, 701, 702, 704, 706, 707, 708, 716]
🔵 src/exec/aggregator.cpp 5 7 71.43% [139, 140]
🔵 src/exprs/agg/factory/aggregate_factory.hpp 2 2 100.00% []
🔵 src/exprs/function_context.h 3 3 100.00% []
🔵 src/exprs/function_context.cpp 1 1 100.00% []
🔵 src/exprs/agg/factory/aggregate_resolver_others.cpp 1 1 100.00% []

@fzhedu fzhedu merged commit 34b655c into StarRocks:main Aug 24, 2023
@fzhedu
Copy link
Contributor Author

fzhedu commented Aug 24, 2023

@mergify backport branch-3.1

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2023

backport branch-3.1

✅ Backports have been created

@fzhedu
Copy link
Contributor Author

fzhedu commented Aug 24, 2023

@mergify backport branch-3.0

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2023

backport branch-3.0

✅ Backports have been created

@fzhedu
Copy link
Contributor Author

fzhedu commented Aug 24, 2023

@mergify backport branch-2.5

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2023

backport branch-2.5

✅ Backports have been created

fzhedu added a commit to fzhedu/starrocks that referenced this pull request Aug 25, 2023
[Feature] Group concat support order by and distinct
fzhedu added a commit to fzhedu/starrocks that referenced this pull request Aug 25, 2023
[Feature] Group concat support order by and distinct

Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
fzhedu added a commit that referenced this pull request Aug 25, 2023
[Feature] Group concat support order by and distinct (backport #28778)
fzhedu added a commit to fzhedu/starrocks that referenced this pull request Aug 26, 2023
[Feature] Group concat support order by and distinct
fzhedu added a commit to fzhedu/starrocks that referenced this pull request Aug 26, 2023
[Feature] Group concat support order by and distinct

Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
fzhedu added a commit that referenced this pull request Aug 31, 2023
#29927)

* Merge pull request #28778 from fzhedu/groupConcat

[Feature] Group concat support order by and distinct

Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
@jaogoy jaogoy requested a review from wangsimo0 September 13, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
behavior_changed documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants