[SPARK-23408][SS][BRANCH-2.3] Synchronize successive AddData actions in Streaming*JoinSuite #23757

HeartSaVioR · 2019-02-10T00:21:49Z

What changes were proposed in this pull request?

The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1

The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.

Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following.

A new action called StreamProgressBlockedActions that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch.
An alias of StreamProgressBlockedActions called MultiAddData is explicitly used in the Streaming*JoinSuites to add data to two memory sources simultaneously.

This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic.

NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4

How was this patch tested?

Modified test cases in Streaming*JoinSuites where there are consecutive AddData actions.

HeartSaVioR · 2019-02-10T00:24:23Z

retest this, please

HeartSaVioR · 2019-02-10T00:30:16Z

retest this, please

SparkQA · 2019-02-10T02:55:41Z

Test build #102133 has finished for PR 23757 at commit be41fa7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T02:58:10Z

Test build #102131 has finished for PR 23757 at commit be41fa7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T03:04:15Z

Test build #102132 has finished for PR 23757 at commit be41fa7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-10T09:29:09Z

NOTE: e08bf2e (my own commit) should be squashed into be41fa7 - I'll sort out later. I would like to confirm flakiness is resolved first.

HeartSaVioR · 2019-02-10T09:29:15Z

Retest this, please

HeartSaVioR · 2019-02-10T09:29:20Z

Retest this, please

HeartSaVioR · 2019-02-10T10:02:34Z

Retest this, please

SparkQA · 2019-02-10T12:08:04Z

Test build #102144 has finished for PR 23757 at commit e08bf2e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T12:45:14Z

Test build #102143 has finished for PR 23757 at commit e08bf2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T13:20:11Z

Test build #102145 has finished for PR 23757 at commit e08bf2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-10T14:38:06Z

I wish d5afd92 would be the last one needed for resolving flakiness tests. Only one failed for e08bf2e and the test failure is related to d5afd92.

HeartSaVioR · 2019-02-10T14:38:18Z

Retest this, please

HeartSaVioR · 2019-02-10T14:45:36Z

Retest this, please

SparkQA · 2019-02-10T17:46:38Z

Test build #102151 has finished for PR 23757 at commit d5afd92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T17:48:35Z

Test build #102152 has finished for PR 23757 at commit d5afd92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T18:00:35Z

Test build #102153 has finished for PR 23757 at commit d5afd92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-10T20:36:15Z

retest this, please

HeartSaVioR · 2019-02-10T20:38:34Z

retest this, please

HeartSaVioR · 2019-02-10T20:44:36Z

retest this, please

HeartSaVioR · 2019-02-10T20:46:53Z

cc. @maropu @srowen
This PR is a set of commits for porting back so IMO we can just pull the branch to apply (with changing fix versions to the each issue per commit accordingly) and close this manually: please let me know when we would like to squash into one - I'll update PR's title and description.

HeartSaVioR · 2019-02-10T21:02:17Z

More clearly, this PR intends to address https://issues.apache.org/jira/browse/SPARK-24211 but while addressing it triggers https://issues.apache.org/jira/browse/SPARK-24239 (may want to change affect version?) so also the fix of SPARK-24239 is included to the PR. My understanding is we could resolve them with setting version to 2.3.4.

SparkQA · 2019-02-10T23:11:50Z

Test build #102161 has finished for PR 23757 at commit d9fcb66.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-10T23:46:49Z

Test build #102159 has finished for PR 23757 at commit d9fcb66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-02-10T23:53:59Z

BTW is this a 'port' of a similar fix from later branches? it would be good to cross-reference those.
You can add the JIRAs that this addresses to the title, unless we're not sure this fully fixes it.

HeartSaVioR · 2019-02-11T08:13:54Z

retest this, please

HeartSaVioR · 2019-02-11T08:14:08Z

retest this, please

SparkQA · 2019-02-11T10:45:13Z

Test build #102186 has finished for PR 23757 at commit 6f26689.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-02-11T11:31:52Z

Test failed from org.apache.spark.sql.FileBasedDataSourceSuite.(It is not a test it is a sbt.testing.SuiteSelector)

Retest this, please

HeartSaVioR · 2019-02-11T12:56:38Z

Retest this, please

SparkQA · 2019-02-11T15:23:40Z

Test build #102199 has finished for PR 23757 at commit 6f26689.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-02-11T16:22:46Z

Just to make sure I understand the status:

This is a straight back-port of https://github.com/apache/spark/pull/20650/files?w=1 , and indeed it looks like the same change we see here with https://github.com/apache/spark/pull/23757/files?w=1 , OK. It can be merged first without other changes, pending tests?

Maybe someone can help me understand this. The diff at
https://github.com/apache/spark/pull/23757/files?w=1
looks pretty different from
https://github.com/apache/spark/pull/23757/files

I know the difference is supposed to be just whitespace in the diff, but the latter seems to show a much more significant reordering of code. I can't figure out what to make of it. Am I missing something obvious?

HeartSaVioR · 2019-02-11T19:59:08Z

This is a straight back-port of https://github.com/apache/spark/pull/20650/files?w=1 , and indeed it looks like the same change we see here with https://github.com/apache/spark/pull/23757/files?w=1 , OK. It can be merged first without other changes, pending tests?

Yes (and you can also resolve SPARK-24211). And then SPARK-23491 and SPARK-23416 can be cherry-picked from origin PRs since they're clean cherry-pick and we already ran tests in this PR before revising.(and you can also resolve SPARK-24239)

Maybe someone can help me understand this. The diff at
https://github.com/apache/spark/pull/23757/files?w=1
looks pretty different from
https://github.com/apache/spark/pull/23757/files

It is just same as what #20650 was. You could check the difference between https://github.com/apache/spark/pull/20650/files?w=1 vs https://github.com/apache/spark/pull/20650/files

The code change add indentation(s) to huge code block: w=1 option ignores differences of indentation(s) whereas no option takes them all diff. You could find the difference from searching logInfo(s"Processing test stream action: $action") from PR with and without option.

BTW, after fixing this I'm seeing very high change of failures from FileBasedDataSourceSuite - as I commented earlier it might need ORC version upgrade to fix some of related issues, so while we seemed to upgrade the ORC dependency from bugfix version (2.4.0 -> 2.4.1) I'm not sure we could do in 2.3 version line.
cc. @dongjoon-hyun Could you help explaining how we dealt with FileBasedDataSourceSuite and relevant test failures? I'm not sure I'm not missing something.

HeartSaVioR · 2019-02-11T20:10:25Z

IMHO once we know we're not adding code to branch-2.3, passed builds could be a complement of pending test. If we need to run test just before merging, I'd propose triggering 5 builds concurrently due to high chance of failure of FileBasedDataSourceSuite.

srowen · 2019-02-11T20:56:37Z

OK after staring at it longer, I see that the difference between the diffs is actually whitespace. It looks much bigger because it's mismatching lines that occur several time identically in different parts of the code, so it renders here as a big delete and add. OK disregard that, just wanted to make sure I'm not crazy.

This is looking good pending tests. Once this passes and is merged, go ahead with any other changes as you see fit.

HeartSaVioR · 2019-02-11T21:16:56Z

I'm not a committer so I need committer's help on cherry-picking remaining two commits (SPARK-23491 and SPARK-23416). Not sure we would want to have PRs for clean cherry-pick: if we would like to have them explicitly please let me know - I'll submit PRs.

HeartSaVioR · 2019-02-11T21:17:25Z

retest this, please

HeartSaVioR · 2019-02-11T21:20:07Z

retest this, please

HeartSaVioR · 2019-02-11T21:21:35Z

Let me run three builds concurrently: as build history in this PR, there's other flaky test which have high chance of failure as well.

SparkQA · 2019-02-12T00:16:58Z

Test build #4551 has finished for PR 23757 at commit 6f26689.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-12T00:21:54Z

Test build #102211 has finished for PR 23757 at commit 6f26689.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-12T00:42:17Z

Test build #102212 has finished for PR 23757 at commit 6f26689.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-12T01:26:25Z

Is the single line below different from the original one?
https://github.com/apache/spark/pull/23757/files?w=1#diff-77e3d95970339f518398191d11e3bb8dR652

cc: @dongjoon-hyun could you help?

HeartSaVioR · 2019-02-12T01:37:12Z

Yes right. Maybe possibly other line but it will be just an import.

maropu · 2019-02-12T01:43:00Z

ok, thanks! Looks ok to me.
(Probably, for reviews, you'd better to separate a modified commit from an original one next time)

srowen

Looks good to me. Just so I'm 100% clear, this is intended to be merged first right?

HeartSaVioR · 2019-02-12T04:19:46Z

Yes right.

…in Streaming*JoinSuite ## What changes were proposed in this pull request? **The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1** The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached. Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following. - A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch. - An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously. This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic. NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4 ## How was this patch tested? Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions. Closes #23757 from HeartSaVioR/fix-streaming-join-test-flakiness-branch-2.3. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

srowen · 2019-02-12T21:19:53Z

Merged to 2.3.

HeartSaVioR · 2019-02-12T23:11:44Z

Thanks for merging!

As a follow-up, we may want to consider doing these things to resolve flaky tests specifically in branch-2.3:

port back SPARK-23491 and SPARK-23416 (don't seem to need to have PRs)
discuss regarding porting back SPARK-23390 and relevant issues for resolving FileBasedDataSourceSuite (related to ORC dependency upgrade if I'm not missing here)

is pretty easy - cause they would be clean cherry-pick - but requires committer to drive efforts. So if someone who is committer interested on this, please take this up.
is a hard one though it incurs high chance of test failure. This seems to need some kind of decision so same as 1) I can't drive efforts. Please take this up if someone who is committer interested on this.

maropu · 2019-02-13T03:38:09Z

I'll cherry-pick SPARK-23491/23416 in brach-2.3.

maropu · 2019-02-13T03:55:45Z

@srowen @HeartSaVioR ok, done. Let's keep watching jenkins tests in branch-2.3.

HeartSaVioR · 2019-02-13T04:35:25Z

Thanks @maropu!

HeartSaVioR changed the title ~~[DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join test flakiness on branch-2.3~~ [DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019

HeartSaVioR force-pushed the fix-streaming-join-test-flakiness-branch-2.3 branch from d5afd92 to d9fcb66 Compare February 10, 2019 20:31

HeartSaVioR changed the title ~~[DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3~~ [SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019

HeartSaVioR changed the title ~~[SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3~~ [SPARK-24211][SPARK-24239][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019

srowen approved these changes Feb 12, 2019

View reviewed changes

srowen closed this Feb 12, 2019

HeartSaVioR deleted the fix-streaming-join-test-flakiness-branch-2.3 branch February 12, 2019 23:11

[SPARK-23408][SS][BRANCH-2.3] Synchronize successive AddData actions in Streaming*JoinSuite #23757

[SPARK-23408][SS][BRANCH-2.3] Synchronize successive AddData actions in Streaming*JoinSuite #23757

Uh oh!

Conversation

HeartSaVioR commented Feb 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

SparkQA commented Feb 10, 2019

Uh oh!

srowen commented Feb 10, 2019

Uh oh!

HeartSaVioR commented Feb 11, 2019

Uh oh!

HeartSaVioR commented Feb 11, 2019

Uh oh!

SparkQA commented Feb 11, 2019

Uh oh!

HeartSaVioR commented Feb 11, 2019

Uh oh!

HeartSaVioR commented Feb 11, 2019

Uh oh!

SparkQA commented Feb 11, 2019

Uh oh!

srowen commented Feb 11, 2019

Uh oh!

HeartSaVioR commented Feb 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Feb 10, 2019 •

edited

Loading

HeartSaVioR commented Feb 10, 2019 •

edited

Loading

HeartSaVioR commented Feb 10, 2019 •

edited

Loading

HeartSaVioR commented Feb 11, 2019 •

edited

Loading

HeartSaVioR commented Feb 11, 2019 •

edited

Loading