-
Notifications
You must be signed in to change notification settings - Fork 28.6k
SPARK-4687. Add a recursive option to the addFile API #3670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c6254a2
to
57df37c
Compare
Test build #24339 has started for PR 3670 at commit
|
Hey Sandy - could you explain in a bit more detail the hive use case? (sorry JIRA is down otherwise I would have commented there). It's a bit inconsistent that this only works for remote directories, but adding local directories would be a pain. It would just be useful to better understand what is going on and if there is a way we can facilitate this with a one-off change in Hive rather than adding this new API in Spark. |
Test build #24339 has finished for PR 3670 at commit
|
Test FAILed. |
* Add a directory to be downloaded with this Spark job on every node. | ||
* The `path` passed must be a directory in HDFS (or other Hadoop-supported | ||
* filesystems). To access the directory in Spark jobs, use | ||
* `SparkFiles.get(directoryName)` to find its download location. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to say that the directory will be recursively copied into the root of Spark's scratch directory.
I think it's fine to add this - it would be good to have a more fully fleshed version including tests, etc. To answer your questions:
IMO - we can punt on adding local directories and add it later if desired. For now, we should throw an exception if a
This only answers part of your question, but we need to decide what the semantics of calling Popping up a bit, I wonder if the user API should still be called |
"recursive" sounds a little weird to me in the sense that it implies adding a directory with recursive=false could be useful. Would it maybe make more sense to have an allowDirectories flag? |
A bit of archaeology reveals that the original motivation for worry about overwriting files was that in local mode files added through It looks like we added the I guess the behavior for directories might be a bit different since you might want to also account for deletions (e.g. if I delete a file in the directory then re-add it, that file should probably be deleted on the executors as well). RE: the recursive flag, I guess the idea here is something like |
Yea my thoughts for |
I looked around a bit to try and find Java libraries that provide a recursive copy option to see if there are some conventions around the API. It turns out most of them don't - both Guava and Java 7's new file utilities don't have a recursive directory copy, probably because it's tricker to implement. |
Uploading a further along work-in-progress patch that switches to addFile recursive. Still to do:
I could use help on the second one @pwendell @JoshRosen if either of you have any ideas. |
57df37c
to
f78a416
Compare
Test build #25556 has started for PR 3670 at commit
|
Test build #25556 has finished for PR 3670 at commit
|
Test FAILed. |
f78a416
to
8413c50
Compare
Test build #25562 has started for PR 3670 at commit
|
Test build #25562 has finished for PR 3670 at commit
|
Test FAILed. |
Test build #25650 has started for PR 3670 at commit
|
Ok here's a version that's ready for review. It still needs a little more doc, polish, and test or two, but would like to get validation on the approach. |
Test build #25650 has finished for PR 3670 at commit
|
Test FAILed. |
Test build #25795 has started for PR 3670 at commit
|
Test build #25795 has finished for PR 3670 at commit
|
Test FAILed. |
3d7af57
to
710cbd9
Compare
Test build #26111 has started for PR 3670 at commit
|
Test build #26111 has finished for PR 3670 at commit
|
Test build #26521 has finished for PR 3670 at commit
|
Test PASSed. |
* supported for Hadoop-supported filesystems. | ||
*/ | ||
def addFile(path: String, recursive: Boolean): Unit = { | ||
val isLocalMode = conf.get("spark.master").startsWith("local") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we already have a SparkContext.isLocal
variable for this, so I'd use that instead of inspecting SparkConf here.
Updated patch addresses @JoshRosen 's comments |
Test build #26563 has started for PR 3670 at commit
|
Test build #26563 has finished for PR 3670 at commit
|
Test FAILed. |
To briefly address the questions @sryza raised in the PR description:
The user-facing API doesn't need to change in order to support this, so we could always choose to add this functionality later if we decide that there's a use-case for it.
There's a few different layers of caching here. In all Spark versions, individual executors cache added files, so files will only be downloaded by the first task that require them. In Spark 1.2.0+, we added a per-host-per-application cache that allows executors for the same application running on the same host to share downloaded files; this is useful in scenarios where you have a huge number of executors per host (see #1616 for more details).
This seems fine, although I guess it could be a lot more expensive to perform this check for a directory than a single file. |
LGTM. There are additional enhancements that we could make to this in the future, such as PySpark support, support for adding local directories, etc., but merging this PR should not block on them: the design here is sufficient for Hive on Spark's needs and doesn't preclude these future enhancements. |
Jenkins, retest this please. |
Test build #26826 has started for PR 3670 at commit
|
Test build #26826 has finished for PR 3670 at commit
|
Test PASSed. |
Since this looks good to me, I'm going to merge it into |
This adds a recursive option to the addFile API to satisfy Hive's needs. It only allows specifying HDFS dirs that will be copied down on every executor. There are a couple outstanding questions. * Should we allow specifying local dirs as well? The best way to do this would probably be to archive them. The drawback is that it would require a fair bit of code that I don't know of any current use cases for. * The addFiles implementation has a caching component that I don't entirely understand. What events are we caching between? AFAICT it's users calling addFile on the same file in the same app at different times? Do we want/need to add something similar for addDirectory. * The addFiles implementation will check to see if an added file already exists and has the same contents. I imagine we want the same behavior, so planning to add this unless people think otherwise. I plan to add some tests if people are OK with the approach. Author: Sandy Ryza <sandy@cloudera.com> Closes #3670 from sryza/sandy-spark-4687 and squashes the following commits: f9fc77f [Sandy Ryza] Josh's comments 70cd24d [Sandy Ryza] Add another test 13da824 [Sandy Ryza] Revert executor changes 38bf94d [Sandy Ryza] Marcelo's comments ca83849 [Sandy Ryza] Add addFile test 1941be3 [Sandy Ryza] Fix test and avoid HTTP server in local mode 31f15a9 [Sandy Ryza] Use cache recursively and fix some compile errors 0239c3d [Sandy Ryza] Change addDirectory to addFile with recursive 46fe70a [Sandy Ryza] SPARK-4687. Add a addDirectory API (cherry picked from commit c4b1108) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
Thanks @JoshRosen ! |
This adds a recursive option to the addFile API to satisfy Hive's needs. It only allows specifying HDFS dirs that will be copied down on every executor.
There are a couple outstanding questions.
I plan to add some tests if people are OK with the approach.