-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17159][Streaming] optimise check for new files in FileInputDStream #17745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
steveloughran
wants to merge
1
commit into
apache:master
from
steveloughran:cloud/SPARK-17159-listfiles-minimal
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this approach, we might be fetching a very large list of files and then filtering through the directories. If the fetched, list is too large, then it can be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, on looking at the code of glob status, it does filter at the end, so doing something like above might just be ok.
Also globStatus does a listStatus() per child directory or a getFileStatus() in case input pattern is not a glob, each call to listStatus does 3+ http calls and each call to getFileStatus does 2 http calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
globStatus is flawed; key limit is that it does a tree walk. It needs to be replaced with an object-store-list specific one. See HADOOP-13371.
The issue with implementing an s3a flat-list and filter is that if the wildcard is a few entries up from the child path and there are lots of children, e..g
then if there are many files under year/month/day entries, all get listed during the filter.
What I think would need to be done is to be able to config the FS to limit the depth of where it switches to bulk listing; so here I could say "depth=2", and so the year=? would be done via globbing, but the month= and day= would be better.
Or maybe just start with making the whole thing optional, and let the caller deal with it.
Anyway, options here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, having an object store specific version of glob, will be broadly helpful. In the mean time, this patch seems to be saving a lot of http requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a lot; I think we can do a new one.
Latest version of this code is here; I think its time to set up a module in bahir for this