-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30348][CORE][TEST] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." #27004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…0: Master should avoid ..."
|
Is there any other way to just put a short delay in the test? 1 second isn't a big deal. It could avoid having to open up internals for testing. But I don't think it's a big deal if there's no easy way to avoid it. |
| // we found the case where the test was too fast which all steps were done within | ||
| // an interval - in this case, we have to check either app is available in master | ||
| // or marked as completed. See SPARK-30348 for details. | ||
| assert(master.idToApp.contains(appId) || master.completedApps.exists(_.id == appId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think hit assert(master.completedApps.exists(_.id == appId)) isn't what we want to see for this test.
Does a short delay above assert(master.idToApp.contains(appId)) or increase MAX_EXECUTOR_RETRIES works? I'd also prefer not to expose internal data for such a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the test was wrong as I added comment in code; the code is observing the state change on different thread which should be relying on listener stuff; if we can't receive updates via listener, we should take into account that we may not be able to capture the state change sequentially.
Regarding adding delay I'll comment again, as Sean also commented that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is, in this test:
1st assert is to make sure that the application exists in Master.idToApp
2nd assert is to check the application has been remove from Master.idToApp after MAX_EXECUTOR_RETRIES.
But if we find the application has completed in 1st assert, which means it doesn't exist in Master.idToApp, I think it has broke original assume.
How about setting up a CountDownLatch in MockExecutorLaunchFailWorker, and count down before we reach MAX_EXECUTOR_RETRIES ? So we can be sure the application hasn't been removed from Master.idToApp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, and the suggestion sounds good to me. Let me apply the approach. Thanks!
|
Test build #115741 has finished for PR 27004 at commit
|
|
I don't think adding delay to adjust timing would work; 9 ms may not be the fastest run, and it can be pretty much slower than that. For example, one of passed case, submitting driver to removing app from master took more than 30 ms, more than 3 times slower than the failure case. As you might notice, we can't predict how long will the execution takes when we increases MAX_EXECUTOR_RETRIES, so that's not an option. There have been so many timing issues while investigating flaky test failures, and in my experience adding delay without exact calculation of timing (only work the timing is predictable) doesn't fix the issue. I've seen couple of issues being reopened if the fix was just adding sleep. If the test fails due to timing issue, we should try not to rely on timing. (Ideally all tests shouldn't rely on timing, but unfortunately sometimes we have to.) Exposing field to package private for testing might not be cool, though we have been allowed it in various spots. We could leverage PrivateMethodTester if we don't want to expose it. Btw, looks like Master already exposes some fields including |
|
test build 115741 was encountered SPARK-30345 which has a fix #27001 |
|
Test build #115752 has finished for PR 27004 at commit
|
|
I totally agree that adding a delay always isn't a good idea, while exposing internal status are also not encouraged. Code around |
14c7976 to
ceeccfc
Compare
|
Test build #115770 has finished for PR 27004 at commit
|
ceeccfc to
0af3a59
Compare
|
Test build #115771 has finished for PR 27004 at commit
|
| // Below code doesn't make driver stuck, as newDriver opens another rpc endpoint for | ||
| // handling driver related messages. It guarantees registering application is done | ||
| // before handling LaunchExecutor message. | ||
| eventually(timeout(10.seconds)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There're two different event dispatchers in MockExecutorLaunchFailWorker; 'worker' and 'driver' (once it receives LaunchDriver message) which we shouldn't assume messages will be handled sequentially across event dispatchers, LaunchExecutor and RegisteredApplication in this case.
That's why I just inject verification code here; this would block handling LaunchExecutor until we verify application registering is done successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would block handling LaunchExecutor until we verify application registering is done successfully.
How does this block handling LaunchExecutor? I believe Worker is not a thread safe RpcEndpoint. Do I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right my bad. I assumed that's ThreadSafeRpcEndpoint and it's not. Just fixed via applying count down latch between them.
|
Test build #115777 has finished for PR 27004 at commit
|
|
Test build #115778 has finished for PR 27004 at commit
|
| drivers += driverId | ||
| driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like drivers/driverResources is useless here.
Ngone51
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if pass all tests. BTW, we need to add [TEST] tag for the title.
| var failedCnt = 0 | ||
|
|
||
| override def receive: PartialFunction[Any, Unit] = { | ||
| case LaunchDriver(driverId, desc, resources_) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Personally, I'd prefer _ for those unused fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Will address.
|
Test build #115784 has finished for PR 27004 at commit
|
|
Test build #115792 has finished for PR 27004 at commit
|
|
retest this, please |
|
I have seen this error -9 globally in a short time... |
|
Test build #115799 has finished for PR 27004 at commit
|
The magic is there; all running builds are terminated at AM 0:00 in PST - I guess that's done for maintenance purpose. |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic otherwise looks reasonable
|
thanks, merging to master! |
|
Thanks for reviewing and merging! |
What changes were proposed in this pull request?
This patch fixes the flaky test failure on MasterSuite, "SPARK-27510: Master should avoid dead loop while launching executor failed in Worker".
The culprit of test failure was ironically the test ran too fast; the interval of
eventuallyis by default "15 ms", but it took only "8 ms" from submitting driver to removing app from master.Given the interval is already tiny, instead of lowering interval, the patch considers above case as well when verifying the status.
Why are the changes needed?
We observed intermittent test failure in Jenkins build which should be fixed.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115664/testReport/
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Modified UT.