[SPARK-24935][SQL][followup] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter #24459

cloud-fan · 2019-04-25T13:44:10Z

What changes were proposed in this pull request?

This is a followup of #24144 . #24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH.

However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the sketches library. The buffer for UPDATE may not support MERGE.

This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH.

How was this patch tested?

a new test case

cloud-fan · 2019-04-25T13:44:59Z

cc @pgandhi999 @m44444

SparkQA · 2019-04-25T15:29:16Z

Test build #104904 has finished for PR 24459 at commit 3d523b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)

m44444 · 2019-04-25T22:43:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala

    // Deserializes an `AggregationBuffer` from the shuffled partial aggregation phase to prepare
    // for global aggregation by merging multiple partial aggregation results within a single group.
-    aggBufferSerDe.deserialize(bytes)
+    HiveUDAFBuffer(aggBufferSerDe.deserialize(bytes), false)


Once the value of canDoMerge is always set false after deserialization, in the merge() function, the aggregationBuffer will be always re-created even the passed buffer parameter is actually a Partial2 or Final state. This, correct me if I am wrong, is a flaw causing performance downgrade.
May need to do none-trivial work in serialize() to include the state.

the deserialized buffer can only appear as the second parameter in merge, so canDoMerge doesn't matter here.

I see, except for the case of falling back from hash agg, and that's what you want to address here, and this is not impacting spark udaf. The logic looks clear and good to me, thanks!

pgandhi999 · 2019-04-29T14:58:46Z

Logic looks good to me, will perform a few quick tests of my own on the PR and get back to you soon @cloud-fan

pgandhi999 · 2019-04-29T19:05:27Z

@cloud-fan The fix LGTM. Had a question though. After this change, do we still need to fix the initialization of aggregate buffer for SortBasedAggregate that #24149 originally addressed in the first commit 400db3d ?

cloud-fan · 2019-04-30T02:30:43Z

@pgandhi999 Yes we need, but your patch can be simplified after the Hive UDAF issue is fixed.

…H in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of #24144 . #24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes #24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7432e7d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2019-04-30T02:36:08Z

thanks, merging to master/2.4!

Since, apache#24459 fixes the init-update-merge issue, the fix here is reverted.

…H in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of apache#24144 . apache#24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes apache#24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? backport #24144 and #24459 to 2.3. ## How was this patch tested? existing tests Closes #24539 from cloud-fan/backport. Lead-authored-by: pgandhi <pgandhi@verizonmedia.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…H in Hive UDAF adapter ## What changes were proposed in this pull request? This is a followup of apache#24144 . apache#24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH. However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE. This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH. ## How was this patch tested? a new test case Closes apache#24459 from cloud-fan/hive-udaf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 7432e7d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter

3d523b6

cloud-fan mentioned this pull request Apr 25, 2019

[SPARK-27207][SQL] : Ensure aggregate buffers are initialized again for So… #24149

Closed

m44444 reviewed Apr 25, 2019

View reviewed changes

cloud-fan closed this in 7432e7d Apr 30, 2019

pgandhi999 pushed a commit to pgandhi999/spark that referenced this pull request Apr 30, 2019

[SPARK-27207] : Reverting the two buffer logic and simplifying the code

8f5c6b0

Since, apache#24459 fixes the init-update-merge issue, the fix here is reverted.

cloud-fan mentioned this pull request May 6, 2019

[SPARK-24935][SQL][2.3] fix Hive UDAF with two aggregation buffers #24539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24935][SQL][followup] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter #24459

[SPARK-24935][SQL][followup] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter #24459

Uh oh!

cloud-fan commented Apr 25, 2019

Uh oh!

cloud-fan commented Apr 25, 2019

Uh oh!

SparkQA commented Apr 25, 2019

Uh oh!

m44444 Apr 25, 2019

Uh oh!

cloud-fan Apr 26, 2019

Uh oh!

m44444 Apr 26, 2019

Uh oh!

pgandhi999 Apr 29, 2019

Uh oh!

pgandhi999 commented Apr 29, 2019

Uh oh!

pgandhi999 commented Apr 29, 2019

Uh oh!

cloud-fan commented Apr 30, 2019

Uh oh!

cloud-fan commented Apr 30, 2019

Uh oh!

Uh oh!

[SPARK-24935][SQL][followup] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter #24459

[SPARK-24935][SQL][followup] support INIT -> UPDATE -> MERGE -> FINISH in Hive UDAF adapter #24459

Uh oh!

Conversation

cloud-fan commented Apr 25, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Apr 25, 2019

Uh oh!

SparkQA commented Apr 25, 2019

Uh oh!

m44444 Apr 25, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

m44444 Apr 26, 2019

Choose a reason for hiding this comment

Uh oh!

pgandhi999 Apr 29, 2019

Choose a reason for hiding this comment

Uh oh!

pgandhi999 commented Apr 29, 2019

Uh oh!

pgandhi999 commented Apr 29, 2019

Uh oh!

cloud-fan commented Apr 30, 2019

Uh oh!

cloud-fan commented Apr 30, 2019

Uh oh!

Uh oh!