[SPARK-3796] Create external service which can serve shuffle files #3001

aarondav · 2014-10-29T20:31:06Z

This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).

This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).

There are several outstanding tasks which must be complete before this PR can be merged:

Complete unit testing of network/shuffle package.
Performance and correctness testing on a real cluster.
Remove example service instantiation from Worker.scala.

There are even more shortcomings of this PR which should be addressed in followup patches:

Don't use Java serializer for RPC layer! It is not cross-version compatible.
Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
Documentation of the feature in the Spark docs.
Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
SSL and SASL integration
Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).

aarondav · 2014-10-29T20:31:49Z

cc @rxin @andrewor14

SparkQA · 2014-10-29T20:34:58Z

Test build #22466 has started for PR 3001 at commit 52cca65.

This patch merges cleanly.

SparkQA · 2014-10-29T20:36:08Z

Test build #22466 has finished for PR 3001 at commit 52cca65.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T20:36:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22466/
Test FAILed.

SparkQA · 2014-10-29T20:39:47Z

Test build #22467 has started for PR 3001 at commit bf6c2dd.

This patch merges cleanly.

SparkQA · 2014-10-29T20:45:12Z

Test build #22468 has started for PR 3001 at commit 6055df6.

This patch merges cleanly.

rxin · 2014-10-29T20:49:25Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

@@ -78,7 +78,7 @@ private[spark] class Executor(
  val executorSource = new ExecutorSource(this, executorId)

  // Initialize Spark environment (using system properties read above)
-  conf.set("spark.executor.id", "executor." + executorId)
+  conf.set("spark.executor.id", executorId)


what is this change about?

This was introduced recently, and I was planning on using it, but ended up not. Still, I was inclined to keep the seemingly more sensible semantics of "spark.executor.id" being the executorId rather than being prefixed. It is currently only used by the "MetricsSystem".

Yeah that makes sense. This was introduced in a patch that was merged not long ago (middle of 1.2 window) so it's OK to change it.

SparkQA · 2014-10-29T21:31:57Z

Test build #22467 has finished for PR 3001 at commit bf6c2dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T21:32:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22467/
Test FAILed.

SparkQA · 2014-10-29T21:44:11Z

Test build #22468 has finished for PR 3001 at commit 6055df6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T21:44:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22468/
Test FAILed.

SparkQA · 2014-10-29T21:57:44Z

Test build #22481 has started for PR 3001 at commit 54af871.

This patch merges cleanly.

SparkQA · 2014-10-29T23:12:38Z

Test build #22481 has finished for PR 3001 at commit 54af871.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T23:12:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22481/
Test PASSed.

rxin · 2014-10-30T21:18:02Z

core/src/main/scala/org/apache/spark/network/BlockTransferService.scala

      port: Int,
-      blockIds: Seq[String],
+      execId: String,


execId should be part of the connection establishment / registration and not part of fetchBlocks

The tricky part here is that "execId" is actually part of the request. I am fetching Executor 6's blocks, while I am myself Executor 4. So there is no API that is exposed at a lower layer to transfer the execId.

ic - does each executor have its own path for shuffle files?

Yes, each executor registers its ExecutorShuffleInfo, which includes its own localDirs (created by the Executor on initialization).

andrewor14 · 2014-10-31T02:03:45Z

Hey @aarondav quick correction for your description. The shuffle service is intended to run in Yarn's NodeManager, not the ApplicationManager (doesn't exist). It's great that the API exposed is so narrow. I think at least for Yarn we'll need to find a way to communicate the port back to the application from the server. We might have to include this info through some metadata when an application is registered.

rxin · 2014-10-31T05:37:31Z

...rk/shuffle/src/main/java/org/apache/spark/network/shuffle/StandaloneShuffleBlockManager.java

+      String execId,
+      ExecutorShuffleConfig executorConfig) {
+    String fullId = getAppExecId(appId, execId);
+    executors.put(fullId, executorConfig);


we should log something here

SparkQA · 2014-11-01T04:52:31Z

Test build #22682 has started for PR 3001 at commit 3d62679.

This patch merges cleanly.

SparkQA · 2014-11-01T05:09:52Z

Test build #22683 has started for PR 3001 at commit 9883918.

This patch merges cleanly.

SparkQA · 2014-11-01T06:09:42Z

Test build #22682 has finished for PR 3001 at commit 3d62679.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T06:09:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22682/
Test PASSed.

SparkQA · 2014-11-01T06:24:59Z

Test build #22683 has finished for PR 3001 at commit 9883918.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T06:25:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22683/
Test PASSed.

SparkQA · 2014-11-01T06:59:51Z

Test build #22687 has started for PR 3001 at commit fd3928b.

This patch merges cleanly.

aarondav · 2014-11-01T07:05:09Z

@rxin Please take a look at my last commit, fd3928b. This is some critical code which handles a couple cases I describe below.

I have completed performance and correctness testing on a real cluster. Performance-wise, I saw no regression from the in-executor version. Additionally, I saw minimal memory usage from the Worker, where I put the server -- I ran several medium-sized shuffles using Workers with 512MB max heap sizes (the default) without noticeable garbage collection or heap growth.

During testing, I noticed that we were dropping map outputs if I killed an executor in the middle of a map or reduce phase (in local testing, my queries ran so quickly that I always killed the executor after the completion of the job). This caused us to unnecessarily recompute the map tasks. I have added code which only drops map outputs if (1) external shuffle is disabled or (2) we're responding to a fetch failure specifically.

rxin · 2014-11-01T07:35:13Z

The very last commit looks fine.

SparkQA · 2014-11-01T08:17:27Z

Test build #22687 has finished for PR 3001 at commit fd3928b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T08:17:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22687/
Test PASSed.

aarondav · 2014-11-01T16:13:55Z

I have renamed Standalone* to External*, removed the example instantiation in Worker.scala, and have pushed the documentation back to a later PR (as that can certainly be done during the QA period).

From my end, this PR is ready to merge.

SparkQA · 2014-11-01T16:19:49Z

Test build #22697 has started for PR 3001 at commit 4d1f8c1.

This patch merges cleanly.

SparkQA · 2014-11-01T17:46:16Z

Test build #22697 has finished for PR 3001 at commit 4d1f8c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T17:46:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22697/
Test PASSed.

rxin · 2014-11-01T21:37:31Z

This looks good overall. I'm going to merge it. There are some nit comments that you can address in followup PRs.

rxin · 2014-11-01T21:39:07Z

network/common/src/main/java/org/apache/spark/network/server/TransportServer.java

@@ -49,11 +49,11 @@
  private ChannelFuture channelFuture;
  private int port = -1;

-  public TransportServer(TransportContext context) {
+  public TransportServer(TransportContext context, int portToBind) {


add javadoc defining the behavior for portToBind == 0

pwendell · 2014-11-02T00:09:36Z

Build changes LGTM - I realize this was already merged.

On Sat, Nov 1, 2014 at 2:43 PM, asfgit notifications@github.com wrote:

Closed #3001 #3001 via f55218a
f55218a
.

—
Reply to this email directly or view it on GitHub
#3001 (comment).

This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark. This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster. Author: Andrew Or <andrew@databricks.com> Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits: ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 0ee67a2 [Andrew Or] Minor wording suggestions 1c66046 [Andrew Or] Remove unused provided dependencies 0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 6489db5 [Andrew Or] Try catch at the right places 7b71d8f [Andrew Or] Add detailed java docs + reword a few comments d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE) 5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 9b6e058 [Andrew Or] Address various feedback f48b20c [Andrew Or] Fix tests again f39daa6 [Andrew Or] Do not make network-yarn an assembly module 761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 15a5b37 [Andrew Or] Fix build for Hadoop 1.x baff916 [Andrew Or] Fix tests 5bf9b7e [Andrew Or] Address a few minor comments 5b419b8 [Andrew Or] Add missing license header 804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled 1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config b4b1f0c [Andrew Or] 4 tabs -> 2 tabs 43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service (cherry picked from commit 61a5cce) Signed-off-by: Andrew Or <andrew@databricks.com>

This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark. This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster. Author: Andrew Or <andrew@databricks.com> Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits: ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 0ee67a2 [Andrew Or] Minor wording suggestions 1c66046 [Andrew Or] Remove unused provided dependencies 0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 6489db5 [Andrew Or] Try catch at the right places 7b71d8f [Andrew Or] Add detailed java docs + reword a few comments d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE) 5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 9b6e058 [Andrew Or] Address various feedback f48b20c [Andrew Or] Fix tests again f39daa6 [Andrew Or] Do not make network-yarn an assembly module 761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service 15a5b37 [Andrew Or] Fix build for Hadoop 1.x baff916 [Andrew Or] Fix tests 5bf9b7e [Andrew Or] Address a few minor comments 5b419b8 [Andrew Or] Add missing license header 804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled 1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config b4b1f0c [Andrew Or] 4 tabs -> 2 tabs 43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service

rxin reviewed Oct 29, 2014
View reviewed changes

rxin reviewed Oct 30, 2014
View reviewed changes

rxin reviewed Oct 31, 2014
View reviewed changes

Add Spark integration test

3d62679

Make suggested build changes

9883918

Do not unregister executor outputs unduly

fd3928b

aarondav added 2 commits November 1, 2014 09:10

Rename Standalone* to External*

705748f

Remove changes to Worker

4d1f8c1

rxin reviewed Nov 1, 2014
View reviewed changes

asfgit closed this in f55218a Nov 1, 2014

andrewor14 mentioned this pull request Nov 4, 2014

[SPARK-3797] Run external shuffle service in Yarn NM #3082

Closed

[SPARK-3796] Create external service which can serve shuffle files #3001

[SPARK-3796] Create external service which can serve shuffle files #3001

Uh oh!

Conversation

aarondav commented Oct 29, 2014

Uh oh!

aarondav commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Oct 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

AmplabJenkins commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

AmplabJenkins commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

aarondav commented Nov 1, 2014

Uh oh!

rxin commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

AmplabJenkins commented Nov 1, 2014

Uh oh!

aarondav commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

AmplabJenkins commented Nov 1, 2014

Uh oh!