Skip to content

[SPARK-4236] Cleanup removed applications' files in shuffle service #3126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

aarondav
Copy link
Contributor

@aarondav aarondav commented Nov 6, 2014

This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.

@@ -95,6 +111,56 @@ public ManagedBuffer getBlockData(String appId, String execId, String blockId) {
}

/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following 32 lines are the only material change in this entire PR. The rest is utility functions imported from core Utils, Java cruft around lambdas and classes, and updating/adding tests.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22972 has started for PR 3126 at commit 0028241.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22973 has started for PR 3126 at commit 382c97d.

  • This patch merges cleanly.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

@andrewor14 @rxin PTAL. For the YARN auxiliary service, I think it is sufficient just to just call ExternalShuffleBlockHandler#applicationRemoved(appId) at the appropriate time.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22972 has finished for PR 3126 at commit 0028241.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class AllocatedBlocks(streamIdToAllocatedBlocks: Map[Int, Seq[ReceivedBlockInfo]])
    • class ReceiverTracker(ssc: StreamingContext, skipReceiverLaunch: Boolean = false) extends Logging

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22972/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #22973 has finished for PR 3126 at commit 382c97d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22973/
Test PASSed.

@aarondav aarondav closed this Nov 6, 2014
@aarondav aarondav deleted the cleanup branch November 6, 2014 18:54
@andrewor14
Copy link
Contributor

Hey was this already merged? I don't see it showing up on ASF.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

Whoops, accidentally deleted it locally.

@aarondav aarondav reopened this Nov 6, 2014
@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23015 has started for PR 3126 at commit 382c97d.

  • This patch merges cleanly.

rpcHandler.clearRegisteredExecutors()
// Invalidate the registered executors, disallowing access to their shuffle blocks (without
// deleting the actual shuffle files, so we could access them without the shuffle service).
rpcHandler.applicationRemoved(sc.conf.getAppId, false /* cleanupLocalDirs */)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just do cleanupLocalDirs = false? Or can we not because this is a Java method even though we're using it in Scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, not possible if the method is written in Java, unfortunately.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23015 has finished for PR 3126 at commit 382c97d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23015/
Test PASSed.

@aarondav
Copy link
Contributor Author

aarondav commented Nov 6, 2014

Comments addressed.

@SparkQA
Copy link

SparkQA commented Nov 6, 2014

Test build #23026 has started for PR 3126 at commit a1f73ab.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23026 has finished for PR 3126 at commit a1f73ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23026/
Test PASSed.

@andrewor14
Copy link
Contributor

LGTM. Hey by the way is there anything we need to change about the way ContextCleaner cleans up shuffle files now when external shuffle service is enabled? The only difference in functionality here is that we will not clean shuffle files based on scope anymore (we currently go through executors but they're now dead). This could be a problem for long-running contexts. Maybe we should add some extra TTL-based cleanup and default the timeout to a week or something in SPARK-4287.

Also I think this conflicts with your other patch which I just merged.

This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.
@aarondav
Copy link
Contributor Author

aarondav commented Nov 7, 2014

It was you -- JavaUtils changed. Updated.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23032 has started for PR 3126 at commit 33a64a9.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor

Weird, we both only added code to JavaUtils. I would think that it doesn't need to conflict there.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23032 has finished for PR 3126 at commit 33a64a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StandaloneWorkerShuffleService(sparkConf: SparkConf, securityManager: SecurityManager)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23032/
Test PASSed.

@andrewor14
Copy link
Contributor

Ok merging into master and 1.2

asfgit pushed a commit that referenced this pull request Nov 7, 2014
This relies on a hook from whoever is hosting the shuffle service to invoke removeApplication() when the application is completed. Once invoked, we will clean up all the executors' shuffle directories we know about.

Author: Aaron Davidson <aaron@databricks.com>

Closes #3126 from aarondav/cleanup and squashes the following commits:

33a64a9 [Aaron Davidson] Missing brace
e6e428f [Aaron Davidson] Address comments
16a0d27 [Aaron Davidson] Cleanup
e4df3e7 [Aaron Davidson] [SPARK-4236] Cleanup removed applications' files in shuffle service

(cherry picked from commit 48a19a6)
Signed-off-by: Andrew Or <andrew@databricks.com>
@asfgit asfgit closed this in 48a19a6 Nov 7, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants