-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19812] YARN shuffle service fails to relocate recovery DB acro… #17748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ss NFS directories
Test build #76115 has finished for PR 17748 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, did not realize the subtlety here !
Thanks Tom
Files.move(f.toPath(), newLoc.toPath()); | ||
// The move here needs to handle moving non-empty directories across NFS mounts | ||
FileSystem fs = FileSystem.getLocal(_conf); | ||
fs.rename(copyFrom, newLoc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much more expensive is this for non nfs cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't measure it but can't imagine there is much difference. I didn't see any noticable delays in NM startup during testing. It is a one time operation on startup to move the DB if changing to use the recovery dir. If you have a specific concern here I can go do some measurements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementations looked different when I looked at LocalFileSystem - and was wondering if there was a perf implementation to the change.
If it is a one time thing used infrequently, then it might not be a concern.
LGTM, can you please merge it Tom ? Thx |
…ss NFS directories ## What changes were proposed in this pull request? Change from using java Files.move to use Hadoop filesystem operations to move the directories. The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly. Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var. ## How was this patch tested? Ran YarnShuffleServiceSuite unit tests. Unfortunately couldn't easily add one here since involves NFS. Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks. Author: Tom Graves <tgraves@apache.org> Closes #17748 from tgravescs/SPARK-19812. (cherry picked from commit 7fecf51) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
Thanks for the review @mridulm merged to master and branch-2.2 |
…ss NFS directories
What changes were proposed in this pull request?
Change from using java Files.move to use Hadoop filesystem operations to move the directories. The java Files.move does not work when moving directories across NFS mounts and in fact also says that if the directory has entries you should do a recursive move. We are already using Hadoop filesystem here so just use the local filesystem from there as it handles this properly.
Note that the DB here is actually a directory of files and not just a single file, hence the change in the name of the local var.
How was this patch tested?
Ran YarnShuffleServiceSuite unit tests. Unfortunately couldn't easily add one here since involves NFS.
Ran manual tests to verify that the DB directories were properly moved across NFS mounted directories. Have been running this internally for weeks.