[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reused unsafe row #28975

HeartSaVioR · 2020-07-02T06:01:50Z

What changes were proposed in this pull request?

This patch fixes the odd join result being occurred from stream-stream join for state store format V2.

There're some spots on V2 path which leverage UnsafeProjection. As the result row is reused, the row should be copied to avoid changing value during reading (or make sure the caller doesn't affect by such behavior) but SymmetricHashJoinStateManager.removeByValueCondition violates the case.

This patch makes KeyWithIndexToValueRowConverterV2.convertValue copy the row by itself so that callers don't need to take care about it. This patch doesn't change the behavior of KeyWithIndexToValueRowConverterV2.convertToValueRow to avoid double-copying, as the caller is expected to store the row which the implementation of state store will call copy().

This patch adds such behavior into each method doc in KeyWithIndexToValueRowConverter, so that further contributors can read through and make sure the change / new addition doesn't break the contract.

Why are the changes needed?

Stream-stream join with state store format V2 (newly added in Spark 3.0.0) has a serious correctness bug which brings indeterministic result.

Does this PR introduce any user-facing change?

Yes, some of Spark 3.0.0 users using stream-stream join from the new checkpoint (as the bug exists to only v2 format path) may encounter wrong join result. This patch will fix it.

How was this patch tested?

Reported case is converted to the new UT, and confirmed UT passed. All UTs in StreamingInnerJoinSuite and StreamingOuterJoinSuite passed as well

…sed unsafe row

HeartSaVioR · 2020-07-02T06:03:12Z

It's reported as left-outer join specific, but given the possibility of impact I took the representation of "stream-stream join" instead of "left/right outer stream-stream join".

HeartSaVioR · 2020-07-02T06:04:56Z

cc. @tdas @zsxwing @jose-torres @gaborgsomogyi @xuanyuanking @uncleGen

SparkQA · 2020-07-02T07:05:02Z

Test build #124869 has finished for PR 28975 at commit 82e5a76.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-02T07:14:49Z

retest this, please

SparkQA · 2020-07-02T13:28:29Z

Test build #124875 has finished for PR 28975 at commit 82e5a76.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-02T15:18:10Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

+
+        // NOTE: we need to make sure `outerOutputIter` is evaluated "after" exhausting all of
+        // elements in `innerOutputIter`, because evaluation of `innerOutputIter` may update
+        // the match flag which the logic for outer join is relying on.


Just to clarify: this comment is not related to the bug and just to document an existing assumption?

Yes right.

TBH I suspected this first and crafted a patch including the part with new iterator explicitly runs the logic after evaluating innerOutputIter, and later realized current logic already dealt with this properly, because removeOldState() doesn't materialize the candidates and evaluate lazily. This patch contains minimal change.

Worth to mention how it works for someone who may need to touch here.

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

…l callers

SparkQA · 2020-07-06T03:22:00Z

Test build #124932 has finished for PR 28975 at commit be34258.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-06T03:32:58Z

retest this, please

SparkQA · 2020-07-06T05:14:03Z

Test build #125017 has finished for PR 28975 at commit be34258.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-06T06:51:37Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

        }

+        // Make a copy on value row, as below cleanup logic may update the value row silently.
+        currentValue = currentValue.copy(value = currentValue.value.copy())


so this is the only place to do copy?

Yes. That wasn't necessary for format V1 as the original row was stored into state store, and state store (strictly saying, the implementation of HDFS state store provider) makes sure these rows are copied version.

For other places, it can propagate to the callers outside of state manager, and looks like these callers don't need to copy the row. (It's super tricky for me to determine whether the copy is necessary or not, if the code is not in a simple loop or stream.)

After seeing the new changes, I think the first version looks better. The caller sides is nested and we still have unnecessary copies for v1 format. What do you think? @viirya

Yep, also prefer the first approach personally. As the issue was in v2 format, the first version is a straightforward way.
@cloud-fan typo? ... unnecessary copies for v1 format

OK I'll roll back the change. I'll also leave a commit sha so we can do back and forth depending on the decision.

Just reverted the latest commit to leave the history and pick the commit selectively according to the decision.

HeartSaVioR · 2020-07-06T08:25:25Z

retest this, please

SparkQA · 2020-07-06T15:43:56Z

Test build #125051 has finished for PR 28975 at commit be34258.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-06T20:07:54Z

retest this, please

SparkQA · 2020-07-06T20:26:03Z

Test build #125121 has finished for PR 28975 at commit be34258.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-06T20:36:46Z

retest this, please

dongjoon-hyun · 2020-07-06T21:56:25Z

#29017 is created based on @HeartSaVioR 's reporting for #28975 (comment) .

SparkQA · 2020-07-06T22:42:53Z

Test build #125126 has finished for PR 28975 at commit be34258.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-07T00:28:46Z

retest this, please

SparkQA · 2020-07-07T04:51:12Z

Test build #125151 has finished for PR 28975 at commit be34258.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-07-07T07:36:27Z

retest this please

HeartSaVioR · 2020-07-07T20:54:11Z

retest this please

…ts to all callers" This reverts commit be34258.

SparkQA · 2020-07-07T21:11:52Z

Test build #125257 has finished for PR 28975 at commit be34258.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-07T21:30:30Z

Test build #125258 has finished for PR 28975 at commit e2201ef.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-08T00:45:23Z

retest this please

SparkQA · 2020-07-08T07:05:05Z

Test build #125271 has finished for PR 28975 at commit e2201ef.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-07-08T07:13:15Z

retest this, please

cloud-fan · 2020-07-08T07:17:19Z

retest this please

cloud-fan · 2020-07-08T07:18:18Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+     * Convert the value row to (actual value, match) pair.
+     *
+     * NOTE: implementations should ensure the result row is NOT reused during execution, as
+     * caller may use the value to store without copy().


we need to update the comment. It's not because of storing, but the caller side updates the row.

I'm not sure I get it. The problem occurs when caller reads the value lately and there's "another" interaction with the method in the middle of. I agree the sentence in the source code comment is not clear as well though.

Would it be better if we can rephrase as "... during execution, so that caller can safely read the value in any time" ?

SGTM. I was referring to #28975 (comment)

SparkQA · 2020-07-08T11:26:37Z

Test build #125332 has finished for PR 28975 at commit 1c011ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-07-08T11:37:42Z

retest this please

SparkQA · 2020-07-08T12:21:20Z

Test build #125317 has finished for PR 28975 at commit e2201ef.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gaborgsomogyi

The change itself looks good, some minors found.

gaborgsomogyi · 2020-07-08T15:07:09Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+  test("SPARK-32148 stream-stream join regression on Spark 3.0.0") {
+    val input1 = MemoryStream[(Timestamp, String, String)]
+    val df1 = input1.toDF
+      .selectExpr("_1 as eventTime", "_2 as id", "_3 as comment")


Any specific reason why not use select? I don't see any expression here.

I guess it's pretty much simpler and more readable than select('_1.as("eventTime"), '_2.as("id"), '_3.as("comment")) (or even with col(...) if ' notation doesn't work for _1, _2, _3).

gaborgsomogyi · 2020-07-08T15:07:17Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+
+    val input2 = MemoryStream[(Timestamp, String, String)]
+    val df2 = input2.toDF
+      .selectExpr("_1 as eventTime", "_2 as id", "_3 as name")


Same here as well.

gaborgsomogyi · 2020-07-08T15:08:08Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+    val joined = df1.as("left")
+      .join(df2.as("right"),
+        expr(s"""
+                |left.id = right.id AND left.eventTime BETWEEN


Nit: indent

The indentation of """ looks vary on the codebase, and I can find same indentation on the codebase.

gaborgsomogyi · 2020-07-08T15:08:56Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+
+    val joined = df1.as("left")
+      .join(df2.as("right"),
+        expr(s"""


Why string interpolation needed?

Ah that's not necessary. Will remove.

gaborgsomogyi · 2020-07-08T15:13:44Z

I'm also suffering from flaky executions, hope this round will pass.

SparkQA · 2020-07-08T17:21:01Z

Test build #125352 has finished for PR 28975 at commit 1c011ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-08T17:44:45Z

retest this please

SparkQA · 2020-07-08T17:57:57Z

Test build #125389 has started for PR 28975 at commit 1c011ab.

### What changes were proposed in this pull request? This PR aims to disable SBT `unidoc` generation testing in Jenkins environment because it's flaky in Jenkins environment and not used for the official documentation generation. Also, GitHub Action has the correct test coverage for the official documentation generation. - #28848 (comment) (amp-jenkins-worker-06) - #28926 (comment) (amp-jenkins-worker-06) - #28969 (comment) (amp-jenkins-worker-06) - #28975 (comment) (amp-jenkins-worker-05) - #28986 (comment) (amp-jenkins-worker-05) - #28992 (comment) (amp-jenkins-worker-06) - #28993 (comment) (amp-jenkins-worker-05) - #28999 (comment) (amp-jenkins-worker-04) - #29010 (comment) (amp-jenkins-worker-03) - #29013 (comment) (amp-jenkins-worker-04) - #29016 (comment) (amp-jenkins-worker-05) - #29025 (comment) (amp-jenkins-worker-04) - #29042 (comment) (amp-jenkins-worker-03) ### Why are the changes needed? Apache Spark `release-build.sh` generates the official document by using the following command. - https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L341 ```bash PRODUCTION=1 RELEASE_VERSION="$SPARK_VERSION" jekyll build ``` And, this is executed by the following `unidoc` command for Scala/Java API doc. - https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L30 ```ruby system("build/sbt -Pkinesis-asl clean compile unidoc") || raise("Unidoc generation failed") ``` However, the PR builder disabled `Jekyll build` and instead has a different test coverage. ```python # determine if docs were changed and if we're inside the amplab environment # note - the below commented out until *all* Jenkins workers can get `jekyll` installed # if "DOCS" in changed_modules and test_env == "amplab_jenkins": # build_spark_documentation() ``` ``` Building Unidoc API Documentation ======================================================================== [info] Building Spark unidoc using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Pspark-ganglia-lgpl -Pkubernetes -Pmesos -Phadoop-cloud -Phive -Phive-thriftserver -Pkinesis-asl -Pyarn unidoc ``` ### Does this PR introduce _any_ user-facing change? No. (This is used only for testing and not used in the official doc generation.) ### How was this patch tested? Pass the Jenkins without doc generation invocation. Closes #29017 from dongjoon-hyun/SPARK-DOC-GEN. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

SparkQA · 2020-07-09T05:53:26Z

Test build #125403 has finished for PR 28975 at commit fb63d7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking

LGTM

gaborgsomogyi

LGTM

cloud-fan · 2020-07-09T07:36:58Z

thanks, merging to master/3.0!

…sed unsafe row ### What changes were proposed in this pull request? This patch fixes the odd join result being occurred from stream-stream join for state store format V2. There're some spots on V2 path which leverage UnsafeProjection. As the result row is reused, the row should be copied to avoid changing value during reading (or make sure the caller doesn't affect by such behavior) but `SymmetricHashJoinStateManager.removeByValueCondition` violates the case. This patch makes `KeyWithIndexToValueRowConverterV2.convertValue` copy the row by itself so that callers don't need to take care about it. This patch doesn't change the behavior of `KeyWithIndexToValueRowConverterV2.convertToValueRow` to avoid double-copying, as the caller is expected to store the row which the implementation of state store will call `copy()`. This patch adds such behavior into each method doc in `KeyWithIndexToValueRowConverter`, so that further contributors can read through and make sure the change / new addition doesn't break the contract. ### Why are the changes needed? Stream-stream join with state store format V2 (newly added in Spark 3.0.0) has a serious correctness bug which brings indeterministic result. ### Does this PR introduce _any_ user-facing change? Yes, some of Spark 3.0.0 users using stream-stream join from the new checkpoint (as the bug exists to only v2 format path) may encounter wrong join result. This patch will fix it. ### How was this patch tested? Reported case is converted to the new UT, and confirmed UT passed. All UTs in StreamingInnerJoinSuite and StreamingOuterJoinSuite passed as well Closes #28975 from HeartSaVioR/SPARK-32148. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 526cb2d) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HeartSaVioR · 2020-07-09T10:40:22Z

Thanks all for reviewing and merging!

[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reu…

82e5a76

…sed unsafe row

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Jul 2, 2020

cloud-fan reviewed Jul 2, 2020

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala Outdated Show resolved Hide resolved

Changed to performance-wise approach, with adding WARN comments to al…

be34258

…l callers

cloud-fan reviewed Jul 6, 2020

View reviewed changes

HeartSaVioR mentioned this pull request Jul 6, 2020

[SPARK-32178][TESTS] Disable test-dependencies.sh from Jenkins jobs #29004

Closed

dongjoon-hyun mentioned this pull request Jul 6, 2020

[SPARK-32233][TESTS] Disable SBT unidoc generation testing in Jenkins #29017

Closed

Revert "Changed to performance-wise approach, with adding WARN commen…

e2201ef

…ts to all callers" This reverts commit be34258.

apache deleted a comment from AmplabJenkins Jul 8, 2020

cloud-fan reviewed Jul 8, 2020

View reviewed changes

Update the code comment

1c011ab

gaborgsomogyi reviewed Jul 8, 2020

View reviewed changes

Review comment

fb63d7e

xuanyuanking approved these changes Jul 9, 2020

View reviewed changes

gaborgsomogyi approved these changes Jul 9, 2020

View reviewed changes

cloud-fan closed this in 526cb2d Jul 9, 2020

HeartSaVioR deleted the SPARK-32148 branch July 9, 2020 10:40

[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reused unsafe row #28975

[SPARK-32148][SS] Fix stream-stream join issue on missing to copy reused unsafe row #28975

Uh oh!

Conversation

HeartSaVioR commented Jul 2, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

HeartSaVioR commented Jul 2, 2020

Uh oh!

SparkQA commented Jul 2, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 6, 2020

Uh oh!

dongjoon-hyun commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

HeartSaVioR commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

xuanyuanking commented Jul 7, 2020

Uh oh!

HeartSaVioR commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

SparkQA commented Jul 7, 2020

Uh oh!

HeartSaVioR commented Jul 8, 2020

Uh oh!

SparkQA commented Jul 8, 2020

Uh oh!

HeartSaVioR commented Jul 8, 2020

Uh oh!

cloud-fan commented Jul 8, 2020

Uh oh!

cloud-fan Jul 6, 2020 •

edited

Loading

dongjoon-hyun commented Jul 6, 2020 •

edited

Loading

cloud-fan Jul 8, 2020 •

edited

Loading

HeartSaVioR Jul 8, 2020 •

edited

Loading