Skip to content

[SPARK-19812] YARN shuffle service fails to relocate recovery DB acro… #17748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

tgravescs
Copy link
Contributor

@tgravescs tgravescs commented Apr 24, 2017

…ss NFS directories

What changes were proposed in this pull request?

Change from using java Files.move to use Hadoop filesystem operations to move the directories. The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.

Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.

How was this patch tested?

Ran YarnShuffleServiceSuite unit tests. Unfortunately couldn't easily add one here since involves NFS.
Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.

@SparkQA
Copy link

SparkQA commented Apr 24, 2017

Test build #76115 has finished for PR 17748 at commit e93f984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, did not realize the subtlety here !
Thanks Tom

Files.move(f.toPath(), newLoc.toPath());
// The move here needs to handle moving non-empty directories across NFS mounts
FileSystem fs = FileSystem.getLocal(_conf);
fs.rename(copyFrom, newLoc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much more expensive is this for non nfs cases ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't measure it but can't imagine there is much difference. I didn't see any noticable delays in NM startup during testing. It is a one time operation on startup to move the DB if changing to use the recovery dir. If you have a specific concern here I can go do some measurements?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementations looked different when I looked at LocalFileSystem - and was wondering if there was a perf implementation to the change.
If it is a one time thing used infrequently, then it might not be a concern.

@mridulm
Copy link
Contributor

mridulm commented Apr 25, 2017

LGTM, can you please merge it Tom ? Thx

asfgit pushed a commit that referenced this pull request Apr 26, 2017
…ss NFS directories

## What changes were proposed in this pull request?

Change from using java Files.move to use Hadoop filesystem operations to move the directories.  The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.

Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.

## How was this patch tested?

Ran YarnShuffleServiceSuite unit tests.  Unfortunately couldn't easily add one here since involves NFS.
Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.

Author: Tom Graves <tgraves@apache.org>

Closes #17748 from tgravescs/SPARK-19812.

(cherry picked from commit 7fecf51)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
@tgravescs
Copy link
Contributor Author

Thanks for the review @mridulm merged to master and branch-2.2

@tgravescs tgravescs closed this Apr 26, 2017
@tgravescs tgravescs reopened this Apr 26, 2017
@asfgit asfgit closed this in 7fecf51 Apr 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants