-
Notifications
You must be signed in to change notification settings - Fork 384
Fix deletion of files in current working directory by clearFiles() #345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This fixes an issue where Spark could delete original files in the current working directory that were added to the job using addFile(). There was also the potential for addFile() to overwrite local files, which is addressed by changing Utils.fetchFile() to log a warning instead of overwriting a file with new contents. This is a short-term fix; a better long-term solution would be to remove the dependence on storing files in the current working directory, since we can't change the cwd from Java.
Utils.copyStream(in, out, true) | ||
if (targetFile.exists && !Files.equal(tempFile, targetFile)) { | ||
logWarning("File " + targetFile + " exists and does not match contents of " + url + | ||
"; using existing version") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turn this into an error and throw an exception here; it's too risky to continue running IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; I pushed a commit to change this to SparkException.
Fix deletion of files in current working directory by clearFiles()
Thanks Josh! |
support distributing extra files to worker for yarn client mode So that user doesn't need to package all dependency into one assemble jar as spark app jar
This is a patch to address @mateiz 's comment in apache/spark#245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes mesos#345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
This addresses an issue where Spark could delete files in the current working directory that were added to the job using
addFile()
. I encountered this issue when working on PySpark's code deployment mechanism, which is based onaddFile()
.From user-code's perspective (e.g. UDFs), files added through
addFile()
are assumed to be in the current working directory. For jobs that are run locally usingDAGScheduler.runLocally()
, tasks run with the driver's current working directory. As a result, files added throughaddFile()
must be copied to the driver's current working directory. There's no mechanism to change the CWD in Java.clearFiles()
andclearJars()
clean up these files when the driver exits. This can be a problem if the original files that were added were in the driver's current working directory, because this will cause them to be deleted.A long-term fix would be to hide the location of fetched files from user code by requiring it to access files through an API like
SparkFiles.get("my-file-name.txt")
. This will require changes to user code and may require changes to Shark.As a short-term fix, this pull request removes the code that deletes files in the current working directory and adds checks to
Utils.fetchFiles()
to avoid overwriting existing local files with new data. The one downside of this change is that it may add junk to the current working directory. This is preferable to accidentally deleting files.I've also added
addFile()
/addJar()
to the Java API.I also added synchronization to
LocalScheduler.updateDependencies
to avoid performing multiple parallel fetches for the same file.