Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Feb 10, 2019

What changes were proposed in this pull request?

The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1

The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.

Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following.

  • A new action called StreamProgressBlockedActions that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch.
  • An alias of StreamProgressBlockedActions called MultiAddData is explicitly used in the Streaming*JoinSuites to add data to two memory sources simultaneously.

This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic.

NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4

How was this patch tested?

Modified test cases in Streaming*JoinSuites where there are consecutive AddData actions.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102133 has finished for PR 23757 at commit be41fa7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102131 has finished for PR 23757 at commit be41fa7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102132 has finished for PR 23757 at commit be41fa7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 10, 2019

NOTE: e08bf2e (my own commit) should be squashed into be41fa7 - I'll sort out later. I would like to confirm flakiness is resolved first.

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

2 similar comments
@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102144 has finished for PR 23757 at commit e08bf2e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102143 has finished for PR 23757 at commit e08bf2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102145 has finished for PR 23757 at commit e08bf2e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

I wish d5afd92 would be the last one needed for resolving flakiness tests. Only one failed for e08bf2e and the test failure is related to d5afd92.

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@HeartSaVioR HeartSaVioR changed the title [DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join test flakiness on branch-2.3 [DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019
@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102151 has finished for PR 23757 at commit d5afd92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102152 has finished for PR 23757 at commit d5afd92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102153 has finished for PR 23757 at commit d5afd92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR HeartSaVioR force-pushed the fix-streaming-join-test-flakiness-branch-2.3 branch from d5afd92 to d9fcb66 Compare February 10, 2019 20:31
@HeartSaVioR
Copy link
Contributor Author

retest this, please

2 similar comments
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR HeartSaVioR changed the title [DO-NOT-MERGE][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 [SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019
@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 10, 2019

cc. @maropu @srowen
This PR is a set of commits for porting back so IMO we can just pull the branch to apply (with changing fix versions to the each issue per commit accordingly) and close this manually: please let me know when we would like to squash into one - I'll update PR's title and description.

@HeartSaVioR
Copy link
Contributor Author

More clearly, this PR intends to address https://issues.apache.org/jira/browse/SPARK-24211 but while addressing it triggers https://issues.apache.org/jira/browse/SPARK-24239 (may want to change affect version?) so also the fix of SPARK-24239 is included to the PR. My understanding is we could resolve them with setting version to 2.3.4.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102161 has finished for PR 23757 at commit d9fcb66.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2019

Test build #102159 has finished for PR 23757 at commit d9fcb66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 10, 2019

BTW is this a 'port' of a similar fix from later branches? it would be good to cross-reference those.
You can add the JIRAs that this addresses to the title, unless we're not sure this fully fixes it.

@HeartSaVioR HeartSaVioR changed the title [SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 [SPARK-24211][SPARK-24239][SQL][BRANCH-2.3] Fix streaming join/kafka continuous mode test flakiness on branch-2.3 Feb 10, 2019
@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102186 has finished for PR 23757 at commit 6f26689.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

Test failed from org.apache.spark.sql.FileBasedDataSourceSuite.(It is not a test it is a sbt.testing.SuiteSelector)

Retest this, please

@HeartSaVioR
Copy link
Contributor Author

Retest this, please

@SparkQA
Copy link

SparkQA commented Feb 11, 2019

Test build #102199 has finished for PR 23757 at commit 6f26689.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Feb 11, 2019

Just to make sure I understand the status:

This is a straight back-port of https://github.com/apache/spark/pull/20650/files?w=1 , and indeed it looks like the same change we see here with https://github.com/apache/spark/pull/23757/files?w=1 , OK. It can be merged first without other changes, pending tests?

Maybe someone can help me understand this. The diff at
https://github.com/apache/spark/pull/23757/files?w=1
looks pretty different from
https://github.com/apache/spark/pull/23757/files

I know the difference is supposed to be just whitespace in the diff, but the latter seems to show a much more significant reordering of code. I can't figure out what to make of it. Am I missing something obvious?

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 11, 2019

This is a straight back-port of https://github.com/apache/spark/pull/20650/files?w=1 , and indeed it looks like the same change we see here with https://github.com/apache/spark/pull/23757/files?w=1 , OK. It can be merged first without other changes, pending tests?

Yes (and you can also resolve SPARK-24211). And then SPARK-23491 and SPARK-23416 can be cherry-picked from origin PRs since they're clean cherry-pick and we already ran tests in this PR before revising.(and you can also resolve SPARK-24239)

Maybe someone can help me understand this. The diff at
https://github.com/apache/spark/pull/23757/files?w=1
looks pretty different from
https://github.com/apache/spark/pull/23757/files

It is just same as what #20650 was. You could check the difference between https://github.com/apache/spark/pull/20650/files?w=1 vs https://github.com/apache/spark/pull/20650/files

The code change add indentation(s) to huge code block: w=1 option ignores differences of indentation(s) whereas no option takes them all diff. You could find the difference from searching logInfo(s"Processing test stream action: $action") from PR with and without option.

BTW, after fixing this I'm seeing very high change of failures from FileBasedDataSourceSuite - as I commented earlier it might need ORC version upgrade to fix some of related issues, so while we seemed to upgrade the ORC dependency from bugfix version (2.4.0 -> 2.4.1) I'm not sure we could do in 2.3 version line.
cc. @dongjoon-hyun Could you help explaining how we dealt with FileBasedDataSourceSuite and relevant test failures? I'm not sure I'm not missing something.

@HeartSaVioR
Copy link
Contributor Author

IMHO once we know we're not adding code to branch-2.3, passed builds could be a complement of pending test. If we need to run test just before merging, I'd propose triggering 5 builds concurrently due to high chance of failure of FileBasedDataSourceSuite.

@srowen
Copy link
Member

srowen commented Feb 11, 2019

OK after staring at it longer, I see that the difference between the diffs is actually whitespace. It looks much bigger because it's mismatching lines that occur several time identically in different parts of the code, so it renders here as a big delete and add. OK disregard that, just wanted to make sure I'm not crazy.

This is looking good pending tests. Once this passes and is merged, go ahead with any other changes as you see fit.

@HeartSaVioR
Copy link
Contributor Author

I'm not a committer so I need committer's help on cherry-picking remaining two commits (SPARK-23491 and SPARK-23416). Not sure we would want to have PRs for clean cherry-pick: if we would like to have them explicitly please let me know - I'll submit PRs.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

1 similar comment
@HeartSaVioR
Copy link
Contributor Author

retest this, please

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Feb 11, 2019

Let me run three builds concurrently: as build history in this PR, there's other flaky test which have high chance of failure as well.

@SparkQA
Copy link

SparkQA commented Feb 12, 2019

Test build #4551 has finished for PR 23757 at commit 6f26689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2019

Test build #102211 has finished for PR 23757 at commit 6f26689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2019

Test build #102212 has finished for PR 23757 at commit 6f26689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 12, 2019

Is the single line below different from the original one?
https://github.com/apache/spark/pull/23757/files?w=1#diff-77e3d95970339f518398191d11e3bb8dR652

cc: @dongjoon-hyun could you help?

@HeartSaVioR
Copy link
Contributor Author

Yes right. Maybe possibly other line but it will be just an import.

@maropu
Copy link
Member

maropu commented Feb 12, 2019

ok, thanks! Looks ok to me.
(Probably, for reviews, you'd better to separate a modified commit from an original one next time)

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just so I'm 100% clear, this is intended to be merged first right?

@HeartSaVioR
Copy link
Contributor Author

Yes right.

srowen pushed a commit that referenced this pull request Feb 12, 2019
…in Streaming*JoinSuite

## What changes were proposed in this pull request?

**The best way to review this PR is to ignore whitespace/indent changes. Use this link - https://github.com/apache/spark/pull/20650/files?w=1**

The stream-stream join tests add data to multiple sources and expect it all to show up in the next batch. But there's a race condition; the new batch might trigger when only one of the AddData actions has been reached.

Prior attempt to solve this issue by jose-torres in #20646 attempted to simultaneously synchronize on all memory sources together when consecutive AddData was found in the actions. However, this carries the risk of deadlock as well as unintended modification of stress tests (see the above PR for a detailed explanation). Instead, this PR attempts the following.

- A new action called `StreamProgressBlockedActions` that allows multiple actions to be executed while the streaming query is blocked from making progress. This allows data to be added to multiple sources that are made visible simultaneously in the next batch.
- An alias of `StreamProgressBlockedActions` called `MultiAddData` is explicitly used in the `Streaming*JoinSuites` to add data to two memory sources simultaneously.

This should avoid unintentional modification of the stress tests (or any other test for that matter) while making sure that the flaky tests are deterministic.

NOTE: This patch is modified a bit from origin PR (#20650) to cover DSv2 incompatibility between Spark 2.3 and 2.4: StreamingDataSourceV2Relation is a class for 2.3, whereas it is a case class for 2.4

## How was this patch tested?

Modified test cases in `Streaming*JoinSuites` where there are consecutive `AddData` actions.

Closes #23757 from HeartSaVioR/fix-streaming-join-test-flakiness-branch-2.3.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
@srowen
Copy link
Member

srowen commented Feb 12, 2019

Merged to 2.3.

@srowen srowen closed this Feb 12, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks for merging!

As a follow-up, we may want to consider doing these things to resolve flaky tests specifically in branch-2.3:

  1. port back SPARK-23491 and SPARK-23416 (don't seem to need to have PRs)
  2. discuss regarding porting back SPARK-23390 and relevant issues for resolving FileBasedDataSourceSuite (related to ORC dependency upgrade if I'm not missing here)
  1. is pretty easy - cause they would be clean cherry-pick - but requires committer to drive efforts. So if someone who is committer interested on this, please take this up.
  2. is a hard one though it incurs high chance of test failure. This seems to need some kind of decision so same as 1) I can't drive efforts. Please take this up if someone who is committer interested on this.

@HeartSaVioR HeartSaVioR deleted the fix-streaming-join-test-flakiness-branch-2.3 branch February 12, 2019 23:11
@maropu
Copy link
Member

maropu commented Feb 13, 2019

I'll cherry-pick SPARK-23491/23416 in brach-2.3.

@maropu
Copy link
Member

maropu commented Feb 13, 2019

@srowen @HeartSaVioR ok, done. Let's keep watching jenkins tests in branch-2.3.

@HeartSaVioR
Copy link
Contributor Author

Thanks @maropu!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants