[SPARK-42034] QueryExecutionListener and Observation API do not work with `foreach` / `reduce` / `foreachPartition` action. #39976

zzzzming95 · 2023-02-11T16:59:49Z

What changes were proposed in this pull request?

Add the name parameter for 'foreach'/'reduce'/'foreachPartition' operators in DataSet#withNewRDDExecutionId. Because the QueryExecutionListener and Observation API is triggered only when the operators have the name parameter.

spark/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala

Line 181 in 84ddd40

e.executionName.isDefined && e.qe.sparkSession.sessionUUID == sessionUUID

Why are the changes needed?

The QueryExecutionListener and Observation API is triggered only when the operators have the name parameter.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add two unit test.

HyukjinKwon · 2023-02-13T00:50:38Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

@@ -960,6 +960,19 @@ class DatasetSuite extends QueryTest
    observe(spark.range(1, 10, 1, 11), Map("percentile_approx_val" -> 5))
  }

+  test("observation on datasets when a DataSet trigger foreach action") {


Suggested change

test("observation on datasets when a DataSet trigger foreach action") {

test("SPARK-42034: observation on datasets when a DataSet trigger foreach action") {

HyukjinKwon · 2023-02-13T00:51:01Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

@@ -96,6 +96,34 @@ class DataFrameCallbackSuite extends QueryTest
    spark.listenerManager.unregister(listener)
  }

+  test("execute callback functions when a DataSet trigger foreach action finished") {


Suggested change

test("execute callback functions when a DataSet trigger foreach action finished") {

test("SPARK-42034: execute callback functions when a DataSet trigger foreach action finished") {

HyukjinKwon · 2023-02-13T00:51:27Z

sql/core/src/test/scala/org/apache/spark/sql/util/DataFrameCallbackSuite.scala

+    assert(metrics(0)._1 == "foreach")
+    assert(metrics(1)._1 == "reduce")
+
+    spark.listenerManager.unregister(listener)


I would add this into finally so the test failure of this doesn't affect other tests.

I know other tests don't. but let's at least do it here.

HyukjinKwon

LGTM from my end. cc @hvanhovell and @beliefer FYI

beliefer · 2023-02-13T04:59:29Z

@HyukjinKwon Thank you for ping me.

beliefer

LGTM too.

HyukjinKwon · 2023-02-13T05:09:33Z

Merged to master.

zzzzming95 · 2023-02-13T13:05:16Z

Thanks @HyukjinKwon @beliefer

vidma · 2023-10-09T12:36:03Z

this wouldn't fix the .observe() not working in df.write.jdbc() because that one uses repartitionedDF.rdd.foreachPartition (i.e. with extra .rdd)!
I think we need to remove .rdd usage from JdbcUtils.scala and make any other changes if necessary.

@zzzzming95 @HyukjinKwon @beliefer

https://github.com/zzzzming95/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L901-L903

HyukjinKwon · 2023-10-10T03:23:02Z

Thanks. Made a followup: #43304

…eachPartition in JdbcUtils ### What changes were proposed in this pull request? This PR is kind of a followup for #39976 that addresses #39976 (comment) comment. ### Why are the changes needed? In order to probably assign the SQL execution ID so `df.observe` works with this. ### Does this PR introduce _any_ user-facing change? Yes. `df.observe` will work with JDBC connectors. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? Unit test was added. Closes #43304 from HyukjinKwon/foreachbatch. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…eachPartition in JdbcUtils This PR is kind of a followup for #39976 that addresses #39976 (comment) comment. In order to probably assign the SQL execution ID so `df.observe` works with this. Yes. `df.observe` will work with JDBC connectors. Manually tested. Unit test was added. Closes #43304 from HyukjinKwon/foreachbatch. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 39cc4ab) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…D.foreachPartition in JdbcUtils This PR cherry-picks #43304 to branch-3.5 --- ### What changes were proposed in this pull request? This PR is kind of a followup for #39976 that addresses #39976 (comment) comment. ### Why are the changes needed? In order to probably assign the SQL execution ID so `df.observe` works with this. ### Does this PR introduce _any_ user-facing change? Yes. `df.observe` will work with JDBC connectors. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? Unit test was added. Closes #43322 from HyukjinKwon/SPARK-45475-3.5. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

SPARK-42034

fcd9f9a

github-actions bot added the SQL label Feb 11, 2023

HyukjinKwon reviewed Feb 13, 2023

View reviewed changes

HyukjinKwon approved these changes Feb 13, 2023

View reviewed changes

beliefer approved these changes Feb 13, 2023

View reviewed changes

HyukjinKwon closed this in a1649ad Feb 13, 2023

HyukjinKwon mentioned this pull request Oct 10, 2023

[SPARK-45475][SQL] Uses DataFrame.foreachPartition instead of RDD.foreachPartition in JdbcUtils #43304

Closed

HyukjinKwon mentioned this pull request Oct 11, 2023

[SPARK-45475][SQL][3.5] Uses DataFrame.foreachPartition instead of RDD.foreachPartition in JdbcUtils #43322

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42034] QueryExecutionListener and Observation API do not work with `foreach` / `reduce` / `foreachPartition` action. #39976

[SPARK-42034] QueryExecutionListener and Observation API do not work with `foreach` / `reduce` / `foreachPartition` action. #39976

zzzzming95 commented Feb 11, 2023

HyukjinKwon Feb 13, 2023

HyukjinKwon Feb 13, 2023

HyukjinKwon Feb 13, 2023

HyukjinKwon Feb 13, 2023

HyukjinKwon left a comment

beliefer commented Feb 13, 2023

beliefer left a comment

HyukjinKwon commented Feb 13, 2023

zzzzming95 commented Feb 13, 2023

vidma commented Oct 9, 2023

HyukjinKwon commented Oct 10, 2023

	test("observation on datasets when a DataSet trigger foreach action") {
	test("SPARK-42034: observation on datasets when a DataSet trigger foreach action") {

	test("execute callback functions when a DataSet trigger foreach action finished") {
	test("SPARK-42034: execute callback functions when a DataSet trigger foreach action finished") {

[SPARK-42034] QueryExecutionListener and Observation API do not work with foreach / reduce / foreachPartition action. #39976

[SPARK-42034] QueryExecutionListener and Observation API do not work with foreach / reduce / foreachPartition action. #39976

Conversation

zzzzming95 commented Feb 11, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon Feb 13, 2023

Choose a reason for hiding this comment

HyukjinKwon Feb 13, 2023

Choose a reason for hiding this comment

HyukjinKwon Feb 13, 2023

Choose a reason for hiding this comment

HyukjinKwon Feb 13, 2023

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

beliefer commented Feb 13, 2023

beliefer left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 13, 2023

zzzzming95 commented Feb 13, 2023

vidma commented Oct 9, 2023

HyukjinKwon commented Oct 10, 2023

[SPARK-42034] QueryExecutionListener and Observation API do not work with `foreach` / `reduce` / `foreachPartition` action. #39976

[SPARK-42034] QueryExecutionListener and Observation API do not work with `foreach` / `reduce` / `foreachPartition` action. #39976