Nn/spark 31620 #690

Nnicolini · 2020-06-11T20:26:31Z

Upstream SPARK-31620 ticket and PR link (if not applicable, explain)

apache#28496 for bug fix
apache#27368 for abstraction needed to apply bug fix

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…n EXPLAIN FORMATTED Currently `EXPLAIN FORMATTED` only report input attributes of HashAggregate/ObjectHashAggregate/SortAggregate, while `EXPLAIN EXTENDED` provides more information of Keys, Functions, etc. This PR enhanced `EXPLAIN FORMATTED` to sync with original explain behavior. The newly added `EXPLAIN FORMATTED` got less information comparing to the original `EXPLAIN EXTENDED` Yes, taking HashAggregate explain result as example. **SQL** ``` EXPLAIN FORMATTED SELECT COUNT(val) + SUM(key) as TOTAL, COUNT(key) FILTER (WHERE val > 1) FROM explain_temp1; ``` **EXPLAIN EXTENDED** ``` == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(val#6), sum(cast(key#5 as bigint)), count(key#5)], output=[TOTAL#62L, count(key) FILTER (WHERE (val > 1))#71L]) +- Exchange SinglePartition, true, [id=#89] +- HashAggregate(keys=[], functions=[partial_count(val#6), partial_sum(cast(key#5 as bigint)), partial_count(key#5) FILTER (WHERE (val#6 > 1))], output=[count#75L, sum#76L, count#77L]) +- *(1) ColumnarToRow +- FileScan parquet default.explain_temp1[key#5,val#6] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/XXX/spark-dev/spark/spark-warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<key:int,val:int> ``` **EXPLAIN FORMATTED - BEFORE** ``` == Physical Plan == * HashAggregate (5) +- Exchange (4) +- HashAggregate (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) ... ... (5) HashAggregate [codegen id : 2] Input: [count#91L, sum#92L, count#93L] ... ... ``` **EXPLAIN FORMATTED - AFTER** ``` == Physical Plan == * HashAggregate (5) +- Exchange (4) +- HashAggregate (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) ... ... (5) HashAggregate [codegen id : 2] Input: [count#91L, sum#92L, count#93L] Keys: [] Functions: [count(val#6), sum(cast(key#5 as bigint)), count(key#5)] Results: [(count(val#6)#84L + sum(cast(key#5 as bigint))#85L) AS TOTAL#78L, count(key#5)#86L AS count(key) FILTER (WHERE (val > 1))#87L] Output: [TOTAL#78L, count(key) FILTER (WHERE (val > 1))#87L] ... ... ``` Three tests added in explain.sql for HashAggregate/ObjectHashAggregate/SortAggregate. Closes apache#27368 from Eric5553/ExplainFormattedAgg. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…agg contains subquery Instead of using `child.output` directly, we should use `inputAggBufferAttributes` from the current agg expression for `Final` and `PartialMerge` aggregates to bind references for their `mergeExpression`. When planning aggregates, the partial aggregate uses agg fucs' `inputAggBufferAttributes` as its output, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L105 For final `HashAggregateExec`, we need to bind the `DeclarativeAggregate.mergeExpressions` with the output of the partial aggregate operator, see https://github.com/apache/spark/blob/v3.0.0-rc1/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L348 This is usually fine. However, if we copy the agg func somehow after agg planning, like `PlanSubqueries`, the `DeclarativeAggregate` will be replaced by a new instance with new `inputAggBufferAttributes` and `mergeExpressions`. Then we can't bind the `mergeExpressions` with the output of the partial aggregate operator, as it uses the `inputAggBufferAttributes` of the original `DeclarativeAggregate` before copy. Note that, `ImperativeAggregate` doesn't have this problem, as we don't need to bind its `mergeExpressions`. It has a different mechanism to access buffer values, via `mutableAggBufferOffset` and `inputAggBufferOffset`. Yes, user hit error previously but run query successfully after this change. Added a regression test. Closes apache#28496 from Ngone51/spark-31620. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Nnicolini · 2020-06-12T22:46:12Z

@rshkv this is the change we talked about earlier this week

rshkv · 2020-06-16T16:39:27Z

When cherry-picking, we're trying to rebase cherry-picks like this one. And have each commit resemble an upstream change as closely as possible. Makes it easy to reconstruct which upstream changes we have and which we don't.

Would you mind taking the full set of changes, even if apparently unrelated?
If there are merge conflicts, would you mind resolving them in the same commit?

On the failing tests: Tests are quite flaky so it's worth re-running once or twice (sorry about that). On quick glance the failures don't seem related but not sure.

Eric5553 and others added 4 commits June 11, 2020 15:58

remove unneeded changes from first commit

96cc52f

remove uneeded tests

68d9dcb

Nnicolini requested a review from rshkv June 12, 2020 22:45

rshkv closed this Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nn/spark 31620 #690

Nn/spark 31620 #690

Uh oh!

Nnicolini commented Jun 11, 2020 •

edited

Loading

Uh oh!

Nnicolini commented Jun 12, 2020

Uh oh!

rshkv commented Jun 16, 2020

Uh oh!

Uh oh!

Nn/spark 31620 #690

Nn/spark 31620 #690

Uh oh!

Conversation

Nnicolini commented Jun 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Upstream SPARK-31620 ticket and PR link (if not applicable, explain)

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Nnicolini commented Jun 12, 2020

Uh oh!

rshkv commented Jun 16, 2020

Uh oh!

Uh oh!

Nnicolini commented Jun 11, 2020 •

edited

Loading