Fix deletion of files in current working directory by clearFiles() #345

JoshRosen · 2012-12-29T01:15:42Z

This addresses an issue where Spark could delete files in the current working directory that were added to the job using addFile(). I encountered this issue when working on PySpark's code deployment mechanism, which is based on addFile().

From user-code's perspective (e.g. UDFs), files added through addFile() are assumed to be in the current working directory. For jobs that are run locally using DAGScheduler.runLocally(), tasks run with the driver's current working directory. As a result, files added through addFile() must be copied to the driver's current working directory. There's no mechanism to change the CWD in Java.

clearFiles() and clearJars() clean up these files when the driver exits. This can be a problem if the original files that were added were in the driver's current working directory, because this will cause them to be deleted.

A long-term fix would be to hide the location of fetched files from user code by requiring it to access files through an API like SparkFiles.get("my-file-name.txt"). This will require changes to user code and may require changes to Shark.

As a short-term fix, this pull request removes the code that deletes files in the current working directory and adds checks to Utils.fetchFiles() to avoid overwriting existing local files with new data. The one downside of this change is that it may add junk to the current working directory. This is preferable to accidentally deleting files.

I've also added addFile()/addJar() to the Java API.

I also added synchronization to LocalScheduler.updateDependencies to avoid performing multiple parallel fetches for the same file.

This fixes an issue where Spark could delete original files in the current working directory that were added to the job using addFile(). There was also the potential for addFile() to overwrite local files, which is addressed by changing Utils.fetchFile() to log a warning instead of overwriting a file with new contents. This is a short-term fix; a better long-term solution would be to remove the dependence on storing files in the current working directory, since we can't change the cwd from Java.

mateiz · 2012-12-29T01:22:40Z

core/src/main/scala/spark/Utils.scala

        Utils.copyStream(in, out, true)
+        if (targetFile.exists && !Files.equal(tempFile, targetFile)) {
+          logWarning("File " + targetFile + " exists and does not match contents of " + url +
+            "; using existing version")


Turn this into an error and throw an exception here; it's too risky to continue running IMO.

I agree; I pushed a commit to change this to SparkException.

Fix deletion of files in current working directory by clearFiles()

mateiz · 2012-12-29T23:01:38Z

Thanks Josh!

support distributing extra files to worker for yarn client mode So that user doesn't need to package all dependency into one assemble jar as spark app jar

@mateiz

This is a patch to address @mateiz 's comment in apache/spark#245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes mesos#345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait

JoshRosen added 3 commits December 28, 2012 17:00

Add synchronization to LocalScheduler.updateDependencies().

bd237d4

Add addFile() and addJar() to JavaSparkContext.

d64fa72

mateiz reviewed Dec 29, 2012
View reviewed changes

Change Utils.fetchFile() warning to SparkException.

397e671

mateiz added a commit that referenced this pull request Dec 29, 2012

Merge pull request #345 from JoshRosen/fix/add-file

3f74f72

Fix deletion of files in current working directory by clearFiles()

mateiz merged commit 3f74f72 into mesos:master Dec 29, 2012

JoshRosen mentioned this pull request Jan 22, 2013

Add SparkFiles.get() API to access files added through addFile(). #394

Merged

JoshRosen mentioned this pull request Jan 7, 2015

SPARK-4687. Add a recursive option to the addFile API apache/spark#3670

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix deletion of files in current working directory by clearFiles() #345

Fix deletion of files in current working directory by clearFiles() #345

Uh oh!

JoshRosen commented Dec 29, 2012

Uh oh!

mateiz Dec 29, 2012

Uh oh!

JoshRosen Dec 29, 2012

Uh oh!

mateiz commented Dec 29, 2012

Uh oh!

Uh oh!

Fix deletion of files in current working directory by clearFiles() #345

Fix deletion of files in current working directory by clearFiles() #345

Uh oh!

Conversation

JoshRosen commented Dec 29, 2012

Uh oh!

mateiz Dec 29, 2012

Choose a reason for hiding this comment

Uh oh!

JoshRosen Dec 29, 2012

Choose a reason for hiding this comment

Uh oh!

mateiz commented Dec 29, 2012

Uh oh!

Uh oh!