-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Group concat support order by and distinct #28778
Conversation
-- result: | ||
-5711937174881 | ||
-5714598445053 | ||
-- !result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
group_concat 'result ‘4, 4’ -> '4,4' size is changed from 4 to 3.
Returns a VARCHAR value. | ||
Returns a string value for each group, but returns NULL if there are no non-NULL values. | ||
|
||
set `group_concat_max_len` to limit the length of output string from a group, its default value is 1024, minimal value is 4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give an example to explain how to use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
be/src/exprs/agg/group_concat.h
Outdated
DCHECK(state.output_col_num > 0); | ||
for (auto i = 0; i < state.output_col_num; ++i) { | ||
if (UNLIKELY(!is_string_type(ctx->get_arg_type(i)->type))) { | ||
ctx->set_error(fmt::format("{}-th input of group_concat should be string type.", i + 1).c_str(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the behavior of this? this should not check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safety check at create, if error, the agg will report error and stop.
// redundancy columns in intermediate results. For example, group_concat(a,b order by 1,2) is rewritten to | ||
// group_concat(cast(a to string), cast(b to string) order by a, b), resulting to keeping 4 columns, but it only needs | ||
// keep 2 columns in intermediate results. | ||
// 3. refactor order-by and distinct function to a combinator to clean the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- skip to order by a if a is already sorted?
group_concat(a order by 1) c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be impossible in hash partition mode, as a
is distributed on several node.
class GroupConcatAggregateFunctionV2 | ||
: public AggregateFunctionBatchHelper<GroupConcatAggregateStateV2, GroupConcatAggregateFunctionV2> { | ||
public: | ||
// group_concat(a, b order by c, d), the arguments are a,b,',',c,d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need extra ,
column ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
,
is the separator, we support is as a varable. If not store as a column, we may need other new way.
if (ctx->get_is_distinct()) { | ||
for (auto row_id = 0; row_id < elem_size; row_id++) { | ||
bool is_duplicated = false; | ||
for (auto next_id = row_id + 1; next_id < elem_size; next_id++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Maybe use
hashset
to avoid repeat compare? - What if the distinct column has been sorted above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the final resut usually is not large after the global distinct, so I let it as a TODO.
state_impl.release_order_by_columns(); | ||
DCHECK(ctx->state()->cancelled_ref() || st.ok()); | ||
for (auto i = 0; i < output_col_num; ++i) { | ||
materialize_column_by_permutation(outputs[i].get(), {(*state_impl.data_columns)[i]}, perm); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that possible late materialize column in the final output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is determited by the repeated ratio for a chunk. if more repeated tuples, do distinct first is better, otherwise sort first is better.
fe/fe-core/src/main/java/com/starrocks/sql/analyzer/FunctionAnalyzer.java
Show resolved
Hide resolved
@@ -134,6 +134,12 @@ static const AggregateFunction* get_function(const std::string& name, LogicalTyp | |||
} | |||
} | |||
|
|||
if (func_version > 6) { | |||
if (name == "group_concat") { | |||
func_name = "group_concat2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the performance will get worse when has none orderby?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference lay at the intermediate results, previous way V1 just concat all strings per group, but the new way V2 store intermediate strings in struct{array[]}, with extra array's offsets costs, one offset per group. So the cost may be not large if group is not large, otherwise not.
[FE PR Coverage Check]😍 pass : 56 / 62 (90.32%) file detail
|
admit test failed due to changing output formats, will fixed by https://github.com/StarRocks/StarRocksTest/pull/3738 |
be/src/exprs/agg/group_concat.h
Outdated
|
||
void update_batch_single_state(FunctionContext* ctx, size_t chunk_size, const Column** columns, | ||
AggDataPtr __restrict state) const override { | ||
GroupConcatAggregateStateV2& state_impl = this->data(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use derived template type parameter instead concrete type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
be/src/exprs/agg/group_concat.h
Outdated
// group_concat(a, b order by c, d), the arguments are a,b,',',c,d | ||
void create_impl(FunctionContext* ctx, GroupConcatAggregateStateV2& state) const { | ||
auto num = ctx->get_num_args(); | ||
state.data_columns = new Columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use unique_ptr or shared_ptr instead raw pointer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
// just copy the first const value. | ||
data_col = down_cast<const ConstColumn*>(columns[i])->data_column().get(); | ||
tmp_row_num = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing branch for processing NullableColumn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nullable column can be update in the state, as the data columns in state are nullable.
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
…t col Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
SonarCloud Quality Gate failed.
|
[FE Incremental Coverage Report]😍 pass : 59 / 62 (95.16%) file detail
|
[BE Incremental Coverage Report]😞 fail : 189 / 275 (68.73%) file detail
|
@mergify backport branch-3.1 |
✅ Backports have been created
|
@mergify backport branch-3.0 |
✅ Backports have been created
|
@mergify backport branch-2.5 |
✅ Backports have been created
|
[Feature] Group concat support order by and distinct
[Feature] Group concat support order by and distinct Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
[Feature] Group concat support order by and distinct (backport #28778)
[Feature] Group concat support order by and distinct
[Feature] Group concat support order by and distinct Signed-off-by: Zhuhe Fang <fzhedu@gmail.com>
support group_concat(distinct x1, x2 order by y1,y2, separator s)
the arguments are listed as : x1, x2, s, y1, y2, output x1, x2, s at last.
the distinct just works on x1, x2, and reject null on x1, x2.
What type of PR is this:
Checklist:
Bugfix cherry-pick branch check: