[SPARK-29950][k8s] Blacklist deleted executors in K8S with dynamic allocation. #26586

vanzin · 2019-11-19T00:28:29Z

The issue here is that when Spark is downscaling the application and deletes
a few pod requests that aren't needed anymore, it may actually race with the
K8S scheduler, who may be bringing up those executors. So they may have enough
time to connect back to the driver, register, to just be deleted soon after.
This wastes resources and causes misleading entries in the driver log.

The change (ab)uses the blacklisting mechanism to consider the deleted excess
pods as blacklisted, so that if they try to connect back, the driver will deny
it.

It also changes the executor registration slightly, since even with the above
change there were misleading logs. That was because the executor registration
message was an RPC that always succeeded (bar network issues), so the executor
would always try to send an unregistration message to the driver, which would
then log several messages about not knowing anything about the executor. The
change makes the registration RPC succeed or fail directly, instead of using
the separate failure message that would lead to this issue.

Note the last change required some changes in a standalone test suite related
to dynamic allocation, since it relied on the driver not throwing exceptions
when a duplicate executor registration happened.

Tested with existing unit tests, and with live cluster with dyn alloc on.

…location. The issue here is that when Spark is downscaling the application and deletes a few pod requests that aren't needed anymore, it may actually race with the K8S scheduler, who may be bringing up those executors. So they may have enough time to connect back to the driver, register, to just be deleted soon after. This wastes resources and causes misleading entries in the driver log. The change (ab)uses the blacklisting mechanism to consider the deleted excess pods as blacklisted, so that if they try to connect back, the driver will deny it. It also changes the executor registration slightly, since even with the above change there were misleading logs. That was because the executor registration message was an RPC that always succeeded (bar network issues), so the executor would always try to send an unregistration message to the driver, which would then log several messages about not knowing anything about the executor. The change makes the registration RPC succeed or fail directly, instead of using the separate failure message that would lead to this issue. Note the last change required some changes in a standalone test suite related to dynamic allocation, since it relied on the driver not throwing exceptions when a duplicate executor registration happened. Tested with existing unit tests, and with live cluster with dyn alloc on.

SparkQA · 2019-11-19T01:10:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/18899/

SparkQA · 2019-11-19T01:29:52Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/18899/

SparkQA · 2019-11-19T02:42:31Z

Test build #114038 has finished for PR 26586 at commit a14c4ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2019-12-02T23:59:14Z

test this please

SparkQA · 2019-12-03T00:27:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/

SparkQA · 2019-12-03T00:33:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/

SparkQA · 2019-12-03T02:32:54Z

Test build #114743 has finished for PR 26586 at commit a14c4ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ifilonenko · 2019-12-04T02:07:31Z

test this please

SparkQA · 2019-12-04T02:37:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19640/

SparkQA · 2019-12-04T02:59:51Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19640/

SparkQA · 2019-12-04T04:01:20Z

Test build #114817 has finished for PR 26586 at commit a14c4ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

couple small nits, I'm not a k8s expert, but overall logic makes sense

tgravescs · 2019-12-09T23:05:55Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

    if (snapshots.nonEmpty) {
      logDebug(s"Pod allocation status: $currentRunningCount running, " +
        s"${currentPendingExecutors.size} pending, " +
        s"${newlyCreatedExecutors.size} unacknowledged.")
+
+      val existingExecs = snapshots.last.executorPods.keySet


I think you could use lastSnapshot instead of snapshots.last

tgravescs · 2019-12-09T23:22:41Z

...netes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala

+  // Executors that have been deleted by this allocator but not yet detected as deleted in
+  // a snapshot from the API server. This is used to deny registration from these executors
+  // if they happen to come up before the deletion takes effect.
+  @volatile private var excessExecutors = Set.empty[Long]


nit: perhaps deletedExecutorIds

tgravescs · 2019-12-09T23:29:25Z

test this please

SparkQA · 2019-12-09T23:58:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19873/

SparkQA · 2019-12-10T00:26:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19873/

SparkQA · 2019-12-10T02:04:06Z

Test build #115055 has finished for PR 26586 at commit a14c4ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

changes LGTM pending Jenkins

SparkQA · 2019-12-10T19:58:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19938/

vanzin · 2019-12-10T22:21:08Z

retest this please

SparkQA · 2019-12-10T22:21:57Z

Test build #115125 has finished for PR 26586 at commit 6c8d4cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-10T22:51:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19944/

SparkQA · 2019-12-10T23:15:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19944/

SparkQA · 2019-12-11T00:16:06Z

Test build #115131 has finished for PR 26586 at commit 6c8d4cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-12-11T01:01:40Z

retest this please

SparkQA · 2019-12-11T01:28:11Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19950/

SparkQA · 2019-12-11T01:59:15Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19950/

SparkQA · 2019-12-11T03:18:05Z

Test build #115138 has finished for PR 26586 at commit 6c8d4cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2019-12-12T23:37:29Z

retest this please

SparkQA · 2019-12-13T00:01:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/20075/

SparkQA · 2019-12-13T00:32:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/20075/

SparkQA · 2019-12-13T02:07:13Z

Test build #115267 has finished for PR 26586 at commit 6c8d4cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-12-13T02:13:41Z

K8s integration failure is consistent, but it seems to be irrelevant to this PR.

- Launcher client dependencies *** FAILED ***
  The code passed to eventually never returned normally. Attempted 29 times over 2.006781855433333 minutes. 
Last failure message: Failed to create bucket spark.. (DepsTestsSuite.scala:213)

- Launcher client dependencies *** FAILED ***
  java.lang.AssertionError: assertion failed:

vanzin · 2019-12-13T21:15:25Z

Yeah, but I'd be more comfortable if I saw that same failure in another PR, and I don't remember seeing it. (These integration tests are also becoming pretty hard to run locally with all the external dependencies...)

vanzin · 2019-12-13T21:32:52Z

Ok, the following is a different PR and fails the same test:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/20076/artifact/resource-managers/kubernetes/integration-tests/target/integration-tests.log

(Also seems to have run around the same time as the run that failed for this one.)

attilapiros · 2019-12-19T15:20:40Z

...t/scala/org/apache/spark/scheduler/cluster/k8s/DeterministicExecutorPodsSnapshotsStore.scala

+      case (_, PodDeleted(_)) => false
+      case _ => true
+    }
+    currentSnapshot = ExecutorPodsSnapshot(nonDeleted)


You could call replaceSnapshot() here.

I know it is unrelated to this change but it is strange to me that notifySubscribers clears the snapshotsBuffer but does not sets currentSnapshot to an empty ExecutorPodsSnapshot().
So this way a notifySubscribers call followed by an updatePod could keeps some pods which with the next notifySubscribers would be notified.

replaceSnapshot takes a seq, nonDeleted is a map.

As for not clearing the current snapshot, that's because snapshots are cumulative. Each update from the k8s server just adds to the previous snapshot (until a periodic full sync replaces it with replaceSnapshot).

attilapiros · 2019-12-19T18:10:32Z

core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala

-        scheduler.driverEndpoint.ask[Boolean](message)
-        eventually(timeout(10.seconds), interval(100.millis)) {
-          verify(endpointRef).send(RegisterExecutorFailed(any()))
+        intercept[Exception] {


What about saving the intercepted exception into a val and checking its content with asserts?

Like:

assert(exception.getCause.getMessage === "Executor is blacklisted: one")

I very much dislike checking error messages, they are not part of the API contract. But I can try to check for a more specific exception.

SparkQA · 2019-12-19T20:30:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/20382/

SparkQA · 2019-12-19T21:06:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/20382/

SparkQA · 2019-12-19T22:02:57Z

Test build #115582 has finished for PR 26586 at commit e6f8814.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ifilonenko

general comment, but otherwise this lgtm. would love to see this change in

ifilonenko · 2020-01-15T23:42:53Z

core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala

          // If the cluster manager gives us an executor on a blacklisted node (because it
          // already started allocating those resources before we informed it of our blacklist,
          // or if it ignored our blacklist), then we reject that executor immediately.
          logInfo(s"Rejecting $executorId as it has been blacklisted.")
-          executorRef.send(RegisterExecutorFailed(s"Executor is blacklisted: $executorId"))
-          context.reply(true)
+          context.sendFailure(new IllegalStateException(s"Executor is blacklisted: $executorId"))


Is there a reason we would rather sendFailure(_) instead of the exiting the executor with a RegisterExecutorFailed message?

Explained in the PR description.

vanzin · 2020-01-16T16:42:18Z

retest this please

SparkQA · 2020-01-16T17:26:03Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21635/

SparkQA · 2020-01-16T17:57:50Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/21635/

SparkQA · 2020-01-16T19:35:52Z

Test build #116864 has finished for PR 26586 at commit e6f8814.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2020-01-16T21:36:45Z

Merging to master.

ifilonenko · 2020-01-17T22:53:54Z

wasn't there a failure in the k8s tests? was this not a bi-product of this PR?

vanzin · 2020-01-18T02:44:46Z

That's the same "Launcher client dependencies" test that seems super flaky, and has failed with the same error in other PRs before this one was merged.

dongjoon-hyun added the KUBERNETES label Nov 19, 2019

tgravescs reviewed Dec 9, 2019

View reviewed changes

Marcelo Vanzin added 2 commits December 10, 2019 11:26

Merge branch 'master' into SPARK-29950

160aebf

Feedback.

6c8d4cd

tgravescs reviewed Dec 10, 2019

View reviewed changes

attilapiros reviewed Dec 19, 2019

View reviewed changes

Catch more specific exception.

e6f8814

ifilonenko reviewed Jan 15, 2020

View reviewed changes

vanzin closed this in dca8380 Jan 16, 2020

[SPARK-29950][k8s] Blacklist deleted executors in K8S with dynamic allocation. #26586

[SPARK-29950][k8s] Blacklist deleted executors in K8S with dynamic allocation. #26586

Uh oh!

Conversation

vanzin commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

SparkQA commented Nov 19, 2019

Uh oh!

shaneknapp commented Dec 2, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

SparkQA commented Dec 3, 2019

Uh oh!

ifilonenko commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

SparkQA commented Dec 4, 2019

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgravescs commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 9, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

vanzin commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 10, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

vanzin commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

SparkQA commented Dec 11, 2019

Uh oh!

vanzin commented Dec 12, 2019

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

dongjoon-hyun commented Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Dec 13, 2019

Uh oh!

vanzin commented Dec 13, 2019

dongjoon-hyun commented Dec 13, 2019 •

edited

Loading

attilapiros Dec 19, 2019 •

edited

Loading