[SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." #27004

HeartSaVioR · 2019-12-24T14:08:04Z

What changes were proposed in this pull request?

This patch fixes the flaky test failure on MasterSuite, "SPARK-27510: Master should avoid dead loop while launching executor failed in Worker".

The culprit of test failure was ironically the test ran too fast; the interval of eventually is by default "15 ms", but it took only "8 ms" from submitting driver to removing app from master.

19/12/23 15:45:06.533 dispatcher-event-loop-6 INFO Master: Registering worker localhost:9999 with 10 cores, 3.6 GiB RAM
19/12/23 15:45:06.534 dispatcher-event-loop-6 INFO Master: Driver submitted org.apache.spark.FakeClass
19/12/23 15:45:06.535 dispatcher-event-loop-6 INFO Master: Launching driver driver-20191223154506-0000 on worker 10001
19/12/23 15:45:06.536 dispatcher-event-loop-9 INFO Master: Registering app name
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Registered app name with ID app-20191223154506-0000
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Launching executor app-20191223154506-0000/0 on worker 10001
19/12/23 15:45:06.537 dispatcher-event-loop-10 INFO Master: Removing executor app-20191223154506-0000/0 because it is FAILED
...
19/12/23 15:45:06.542 dispatcher-event-loop-19 ERROR Master: Application name with ID app-20191223154506-0000 failed 10 times; removing it

Given the interval is already tiny, instead of lowering interval, the patch considers above case as well when verifying the status.

Why are the changes needed?

We observed intermittent test failure in Jenkins build which should be fixed.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115664/testReport/

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Modified UT.

…0: Master should avoid ..."

srowen · 2019-12-24T14:21:01Z

Is there any other way to just put a short delay in the test? 1 second isn't a big deal. It could avoid having to open up internals for testing. But I don't think it's a big deal if there's no easy way to avoid it.

Ngone51 · 2019-12-24T16:10:37Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+        // we found the case where the test was too fast which all steps were done within
+        // an interval - in this case, we have to check either app is available in master
+        // or marked as completed. See SPARK-30348 for details.
+        assert(master.idToApp.contains(appId) || master.completedApps.exists(_.id == appId))


I think hit assert(master.completedApps.exists(_.id == appId)) isn't what we want to see for this test.

Does a short delay above assert(master.idToApp.contains(appId)) or increase MAX_EXECUTOR_RETRIES works? I'd also prefer not to expose internal data for such a fix.

I'd say the test was wrong as I added comment in code; the code is observing the state change on different thread which should be relying on listener stuff; if we can't receive updates via listener, we should take into account that we may not be able to capture the state change sequentially.

Regarding adding delay I'll comment again, as Sean also commented that.

My concern is, in this test:

1st assert is to make sure that the application exists in Master.idToApp

2nd assert is to check the application has been remove from Master.idToApp after MAX_EXECUTOR_RETRIES.

But if we find the application has completed in 1st assert, which means it doesn't exist in Master.idToApp, I think it has broke original assume.

How about setting up a CountDownLatch in MockExecutorLaunchFailWorker, and count down before we reach MAX_EXECUTOR_RETRIES ? So we can be sure the application hasn't been removed from Master.idToApp.

Makes sense, and the suggestion sounds good to me. Let me apply the approach. Thanks!

SparkQA · 2019-12-24T16:11:03Z

Test build #115741 has finished for PR 27004 at commit 50b1d05.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-24T22:53:33Z

I don't think adding delay to adjust timing would work; 9 ms may not be the fastest run, and it can be pretty much slower than that. For example, one of passed case, submitting driver to removing app from master took more than 30 ms, more than 3 times slower than the failure case. As you might notice, we can't predict how long will the execution takes when we increases MAX_EXECUTOR_RETRIES, so that's not an option.

There have been so many timing issues while investigating flaky test failures, and in my experience adding delay without exact calculation of timing (only work the timing is predictable) doesn't fix the issue. I've seen couple of issues being reopened if the fix was just adding sleep. If the test fails due to timing issue, we should try not to rely on timing. (Ideally all tests shouldn't rely on timing, but unfortunately sometimes we have to.)

Exposing field to package private for testing might not be cool, though we have been allowed it in various spots. We could leverage PrivateMethodTester if we don't want to expose it.

Btw, looks like Master already exposes some fields including idToApp as public (even not package private) which are "mutable", worse than the change. Would we want to clean up these as well? Either just making package private or leveraging PrivateMethodTester (might be able to include in this PR), or find better way to expose safely, like only exposing method which provides copied version of the instance.

HeartSaVioR · 2019-12-24T23:31:59Z

test build 115741 was encountered SPARK-30345 which has a fix #27001

SparkQA · 2019-12-25T01:35:06Z

Test build #115752 has finished for PR 27004 at commit ccca5f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-25T02:05:50Z

I totally agree that adding a delay always isn't a good idea, while exposing internal status are also not encouraged.

Code around idToApp is quite old, so it's common it remains such problems. But for now, if we find such changes, I think it's better to improve it now.

SparkQA · 2019-12-25T13:06:08Z

Test build #115770 has finished for PR 27004 at commit ceeccfc.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf)

SparkQA · 2019-12-25T14:52:41Z

Test build #115771 has finished for PR 27004 at commit 0af3a59.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf)

HeartSaVioR · 2019-12-25T23:41:36Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+      // Below code doesn't make driver stuck, as newDriver opens another rpc endpoint for
+      // handling driver related messages. It guarantees registering application is done
+      // before handling LaunchExecutor message.
+      eventually(timeout(10.seconds)) {


There're two different event dispatchers in MockExecutorLaunchFailWorker; 'worker' and 'driver' (once it receives LaunchDriver message) which we shouldn't assume messages will be handled sequentially across event dispatchers, LaunchExecutor and RegisteredApplication in this case.

That's why I just inject verification code here; this would block handling LaunchExecutor until we verify application registering is done successfully.

this would block handling LaunchExecutor until we verify application registering is done successfully.

How does this block handling LaunchExecutor? I believe Worker is not a thread safe RpcEndpoint. Do I miss something?

Yeah you're right my bad. I assumed that's ThreadSafeRpcEndpoint and it's not. Just fixed via applying count down latch between them.

SparkQA · 2019-12-26T00:58:21Z

Test build #115777 has finished for PR 27004 at commit 3ee2cc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-26T01:47:13Z

Test build #115778 has finished for PR 27004 at commit 07d78cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-26T02:43:32Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

+      drivers += driverId
+      driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet))


Looks like drivers/driverResources is useless here.

Ngone51

LGTM if pass all tests. BTW, we need to add [TEST] tag for the title.

Ngone51 · 2019-12-26T05:15:29Z

core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala

  var failedCnt = 0
+
  override def receive: PartialFunction[Any, Unit] = {
+    case LaunchDriver(driverId, desc, resources_) =>


nit: Personally, I'd prefer _ for those unused fields.

Good point. Will address.

SparkQA · 2019-12-26T05:50:26Z

Test build #115784 has finished for PR 27004 at commit c4775fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-26T08:05:02Z

Test build #115792 has finished for PR 27004 at commit 8b7b1a1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-26T08:14:20Z

retest this, please

Ngone51 · 2019-12-26T08:19:03Z

I have seen this error -9 globally in a short time...

SparkQA · 2019-12-26T11:01:15Z

Test build #115799 has finished for PR 27004 at commit 8b7b1a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-26T12:27:37Z

I have seen this error -9 globally in a short time...

The magic is there; all running builds are terminated at AM 0:00 in PST - I guess that's done for maintenance purpose.

HeartSaVioR · 2019-12-29T12:02:03Z

cc. @srowen @cloud-fan @jiangxb1987

srowen

The logic otherwise looks reasonable

cloud-fan · 2019-12-30T06:37:28Z

thanks, merging to master!

HeartSaVioR · 2019-12-30T06:57:21Z

Thanks for reviewing and merging!

[SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-2751…

50b1d05

…0: Master should avoid ..."

HeartSaVioR mentioned this pull request Dec 24, 2019

[SPARK-29779][CORE] Compact old event log files and cleanup #26416

Closed

Ngone51 reviewed Dec 24, 2019

View reviewed changes

Avoid exposing completedApps via leveraging PrivateMethodTester

ccca5f5

HeartSaVioR force-pushed the SPARK-30348 branch from 14c7976 to ceeccfc Compare December 25, 2019 13:01

Add CountDownLatch to try to verify all steps sequentially

0af3a59

HeartSaVioR force-pushed the SPARK-30348 branch from ceeccfc to 0af3a59 Compare December 25, 2019 13:10

HeartSaVioR added 2 commits December 26, 2019 07:27

Fix another timing issue

3ee2cc2

Add verification on expectation for registering app

07d78cc

HeartSaVioR commented Dec 25, 2019

View reviewed changes

Ngone51 reviewed Dec 26, 2019

View reviewed changes

HeartSaVioR added 2 commits December 26, 2019 12:07

Fix

3a0d583

refine

c4775fc

Ngone51 reviewed Dec 26, 2019

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."~~ [SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." Dec 26, 2019

Address nitpick

8b7b1a1

srowen reviewed Dec 29, 2019

View reviewed changes

cloud-fan closed this in 8092d63 Dec 30, 2019

HeartSaVioR deleted the SPARK-30348 branch December 30, 2019 06:57

		drivers += driverId
		driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet))

[SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." #27004

[SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." #27004

Uh oh!

Conversation

HeartSaVioR commented Dec 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

srowen commented Dec 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Dec 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 24, 2019

Uh oh!

HeartSaVioR commented Dec 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HeartSaVioR commented Dec 24, 2019

Uh oh!

SparkQA commented Dec 25, 2019

Uh oh!

Ngone51 commented Dec 25, 2019

Uh oh!

SparkQA commented Dec 25, 2019

Uh oh!

SparkQA commented Dec 25, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ngone51 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

HeartSaVioR commented Dec 26, 2019

Uh oh!

Ngone51 commented Dec 26, 2019

Uh oh!

SparkQA commented Dec 26, 2019

Uh oh!

HeartSaVioR commented Dec 26, 2019

Uh oh!

HeartSaVioR commented Dec 29, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 30, 2019

Uh oh!

HeartSaVioR commented Dec 30, 2019

Uh oh!

Reviewers

Assignees

HeartSaVioR commented Dec 24, 2019 •

edited

Loading

HeartSaVioR Dec 24, 2019 •

edited

Loading

HeartSaVioR commented Dec 24, 2019 •

edited

Loading