Skip to content

Conversation

@HeartSaVioR
Copy link
Contributor

@HeartSaVioR HeartSaVioR commented Dec 24, 2019

What changes were proposed in this pull request?

This patch fixes the flaky test failure on MasterSuite, "SPARK-27510: Master should avoid dead loop while launching executor failed in Worker".

The culprit of test failure was ironically the test ran too fast; the interval of eventually is by default "15 ms", but it took only "8 ms" from submitting driver to removing app from master.

19/12/23 15:45:06.533 dispatcher-event-loop-6 INFO Master: Registering worker localhost:9999 with 10 cores, 3.6 GiB RAM
19/12/23 15:45:06.534 dispatcher-event-loop-6 INFO Master: Driver submitted org.apache.spark.FakeClass
19/12/23 15:45:06.535 dispatcher-event-loop-6 INFO Master: Launching driver driver-20191223154506-0000 on worker 10001
19/12/23 15:45:06.536 dispatcher-event-loop-9 INFO Master: Registering app name
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Registered app name with ID app-20191223154506-0000
19/12/23 15:45:06.537 dispatcher-event-loop-9 INFO Master: Launching executor app-20191223154506-0000/0 on worker 10001
19/12/23 15:45:06.537 dispatcher-event-loop-10 INFO Master: Removing executor app-20191223154506-0000/0 because it is FAILED
...
19/12/23 15:45:06.542 dispatcher-event-loop-19 ERROR Master: Application name with ID app-20191223154506-0000 failed 10 times; removing it

Given the interval is already tiny, instead of lowering interval, the patch considers above case as well when verifying the status.

Why are the changes needed?

We observed intermittent test failure in Jenkins build which should be fixed.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115664/testReport/

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Modified UT.

@srowen
Copy link
Member

srowen commented Dec 24, 2019

Is there any other way to just put a short delay in the test? 1 second isn't a big deal. It could avoid having to open up internals for testing. But I don't think it's a big deal if there's no easy way to avoid it.

// we found the case where the test was too fast which all steps were done within
// an interval - in this case, we have to check either app is available in master
// or marked as completed. See SPARK-30348 for details.
assert(master.idToApp.contains(appId) || master.completedApps.exists(_.id == appId))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hit assert(master.completedApps.exists(_.id == appId)) isn't what we want to see for this test.

Does a short delay above assert(master.idToApp.contains(appId)) or increase MAX_EXECUTOR_RETRIES works? I'd also prefer not to expose internal data for such a fix.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Dec 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the test was wrong as I added comment in code; the code is observing the state change on different thread which should be relying on listener stuff; if we can't receive updates via listener, we should take into account that we may not be able to capture the state change sequentially.

Regarding adding delay I'll comment again, as Sean also commented that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is, in this test:

1st assert is to make sure that the application exists in Master.idToApp

2nd assert is to check the application has been remove from Master.idToApp after MAX_EXECUTOR_RETRIES.

But if we find the application has completed in 1st assert, which means it doesn't exist in Master.idToApp, I think it has broke original assume.

How about setting up a CountDownLatch in MockExecutorLaunchFailWorker, and count down before we reach MAX_EXECUTOR_RETRIES ? So we can be sure the application hasn't been removed from Master.idToApp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, and the suggestion sounds good to me. Let me apply the approach. Thanks!

@SparkQA
Copy link

SparkQA commented Dec 24, 2019

Test build #115741 has finished for PR 27004 at commit 50b1d05.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

HeartSaVioR commented Dec 24, 2019

I don't think adding delay to adjust timing would work; 9 ms may not be the fastest run, and it can be pretty much slower than that. For example, one of passed case, submitting driver to removing app from master took more than 30 ms, more than 3 times slower than the failure case. As you might notice, we can't predict how long will the execution takes when we increases MAX_EXECUTOR_RETRIES, so that's not an option.

There have been so many timing issues while investigating flaky test failures, and in my experience adding delay without exact calculation of timing (only work the timing is predictable) doesn't fix the issue. I've seen couple of issues being reopened if the fix was just adding sleep. If the test fails due to timing issue, we should try not to rely on timing. (Ideally all tests shouldn't rely on timing, but unfortunately sometimes we have to.)

Exposing field to package private for testing might not be cool, though we have been allowed it in various spots. We could leverage PrivateMethodTester if we don't want to expose it.

Btw, looks like Master already exposes some fields including idToApp as public (even not package private) which are "mutable", worse than the change. Would we want to clean up these as well? Either just making package private or leveraging PrivateMethodTester (might be able to include in this PR), or find better way to expose safely, like only exposing method which provides copied version of the instance.

@HeartSaVioR
Copy link
Contributor Author

test build 115741 was encountered SPARK-30345 which has a fix #27001

@SparkQA
Copy link

SparkQA commented Dec 25, 2019

Test build #115752 has finished for PR 27004 at commit ccca5f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member

Ngone51 commented Dec 25, 2019

I totally agree that adding a delay always isn't a good idea, while exposing internal status are also not encouraged.

Code around idToApp is quite old, so it's common it remains such problems. But for now, if we find such changes, I think it's better to improve it now.

@SparkQA
Copy link

SparkQA commented Dec 25, 2019

Test build #115770 has finished for PR 27004 at commit ceeccfc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf)

@SparkQA
Copy link

SparkQA commented Dec 25, 2019

Test build #115771 has finished for PR 27004 at commit 0af3a59.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf)

// Below code doesn't make driver stuck, as newDriver opens another rpc endpoint for
// handling driver related messages. It guarantees registering application is done
// before handling LaunchExecutor message.
eventually(timeout(10.seconds)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There're two different event dispatchers in MockExecutorLaunchFailWorker; 'worker' and 'driver' (once it receives LaunchDriver message) which we shouldn't assume messages will be handled sequentially across event dispatchers, LaunchExecutor and RegisteredApplication in this case.

That's why I just inject verification code here; this would block handling LaunchExecutor until we verify application registering is done successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would block handling LaunchExecutor until we verify application registering is done successfully.

How does this block handling LaunchExecutor? I believe Worker is not a thread safe RpcEndpoint. Do I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right my bad. I assumed that's ThreadSafeRpcEndpoint and it's not. Just fixed via applying count down latch between them.

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115777 has finished for PR 27004 at commit 3ee2cc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115778 has finished for PR 27004 at commit 07d78cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 111 to 112
drivers += driverId
driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like drivers/driverResources is useless here.

Copy link
Member

@Ngone51 Ngone51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if pass all tests. BTW, we need to add [TEST] tag for the title.

var failedCnt = 0

override def receive: PartialFunction[Any, Unit] = {
case LaunchDriver(driverId, desc, resources_) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Personally, I'd prefer _ for those unused fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Will address.

@HeartSaVioR HeartSaVioR changed the title [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." [SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." Dec 26, 2019
@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115784 has finished for PR 27004 at commit c4775fc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115792 has finished for PR 27004 at commit 8b7b1a1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

retest this, please

@Ngone51
Copy link
Member

Ngone51 commented Dec 26, 2019

I have seen this error -9 globally in a short time...

@SparkQA
Copy link

SparkQA commented Dec 26, 2019

Test build #115799 has finished for PR 27004 at commit 8b7b1a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor Author

I have seen this error -9 globally in a short time...

The magic is there; all running builds are terminated at AM 0:00 in PST - I guess that's done for maintenance purpose.

@HeartSaVioR
Copy link
Contributor Author

cc. @srowen @cloud-fan @jiangxb1987

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic otherwise looks reasonable

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8092d63 Dec 30, 2019
@HeartSaVioR
Copy link
Contributor Author

Thanks for reviewing and merging!

@HeartSaVioR HeartSaVioR deleted the SPARK-30348 branch December 30, 2019 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants