Skip to content

Fix deletion of files in current working directory by clearFiles() #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 29, 2012

Conversation

JoshRosen
Copy link
Member

This addresses an issue where Spark could delete files in the current working directory that were added to the job using addFile(). I encountered this issue when working on PySpark's code deployment mechanism, which is based on addFile().

From user-code's perspective (e.g. UDFs), files added through addFile() are assumed to be in the current working directory. For jobs that are run locally using DAGScheduler.runLocally(), tasks run with the driver's current working directory. As a result, files added through addFile() must be copied to the driver's current working directory. There's no mechanism to change the CWD in Java.

clearFiles() and clearJars() clean up these files when the driver exits. This can be a problem if the original files that were added were in the driver's current working directory, because this will cause them to be deleted.

A long-term fix would be to hide the location of fetched files from user code by requiring it to access files through an API like SparkFiles.get("my-file-name.txt"). This will require changes to user code and may require changes to Shark.

As a short-term fix, this pull request removes the code that deletes files in the current working directory and adds checks to Utils.fetchFiles() to avoid overwriting existing local files with new data. The one downside of this change is that it may add junk to the current working directory. This is preferable to accidentally deleting files.

I've also added addFile()/addJar() to the Java API.

I also added synchronization to LocalScheduler.updateDependencies to avoid performing multiple parallel fetches for the same file.

This fixes an issue where Spark could delete
original files in the current working directory
that were added to the job using addFile().

There was also the potential for addFile() to
overwrite local files, which is addressed by
changing Utils.fetchFile() to log a warning
instead of overwriting a file with new contents.

This is a short-term fix; a better long-term
solution would be to remove the dependence on
storing files in the current working directory,
since we can't change the cwd from Java.
Utils.copyStream(in, out, true)
if (targetFile.exists && !Files.equal(tempFile, targetFile)) {
logWarning("File " + targetFile + " exists and does not match contents of " + url +
"; using existing version")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn this into an error and throw an exception here; it's too risky to continue running IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I pushed a commit to change this to SparkException.

mateiz added a commit that referenced this pull request Dec 29, 2012
Fix deletion of files in current working directory by clearFiles()
@mateiz mateiz merged commit 3f74f72 into mesos:master Dec 29, 2012
@mateiz
Copy link
Member

mateiz commented Dec 29, 2012

Thanks Josh!

ash211 pushed a commit to ash211/mesos-spark that referenced this pull request Jan 10, 2014
support distributing extra files to worker for yarn client mode

So that user doesn't need to package all dependency into one assemble jar as spark app jar
harveyfeng pushed a commit to harveyfeng/spark-mesos that referenced this pull request Apr 11, 2014
This is a patch to address @mateiz 's comment in apache/spark#245

MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass.

Author: Xiangrui Meng <meng@databricks.com>

Closes mesos#345 from mengxr/label-parser and squashes the following commits:

ac44409 [Xiangrui Meng] use singleton objects for label parsers
3b1a7c6 [Xiangrui Meng] add tests for label parsers
c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton
11c94e0 [Xiangrui Meng] add return types
7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants