-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OK, I see what you've done here: hidden a filter in Path and then used it later on. I like the trick At the same time, I think it's not going to get past anyone else: its making a fundamental change to a core class, one that get serialized around and created a lot. It's not going to be allowed. That's OK though, for the following reason: we have the freedom to add/extend the methods in S3aFS itself, so can do one which takes a filter as a parameter. If we do it right, we can get this into the FS spec, or at least start negotiating on that topic (needs: spec, tests, etc), while implementing it in S3AFS without waiting. |
Thanks @steveloughran I understand your point that getting the sign off for the core class changes is not easy. At the same time, #204 seems to be a big change. I was wondering if there is a way to meet at somewhere in the middle. I meant to provide a minimal strategy rather than a full complete solution in this pull request, because I thought it is important to provide the end users a way to glob things on S3. It easily hits OOM with the current code. Meanwhile, I will keep trying to contribute to #204, which seems to be a right long term solution. Also, I made a few other fixes related to S3A. My current employer just allowed me to spend 20% of my time to contribute back to the community. I hope you don't mind that I mention your name in the pull requests that I am going to file. |
Changes to Path aren't going to happen. Sorry. changes within the s3a code base cause damage restricted to s3a://, so there is less resistance to that change. This PR isn't going to get in. Sorry |
Closing in favor of #204 |
Author: Boris Shkolnik <boryas@apache.org> Reviewers: Navina Ramesh <navina@apache.org> Closes apache#203 from sborya/DebounceConfig
Hi @steveloughran
This pull request is for fixing (mitigating) the issue of HADOOP-13371.
With this patch, it now passes the filter before glob happens.
I had an issue of getting OOM for globbing large s3 buckets before since it kept all possible paths and the filtering happened at the end. Now this patch prunes unnecessary paths with the filter first. I applied this patch to our production pipelines, things run flawlessly.
This should be applicable to branch-2.8 as well.
Thanks in advance for reviewing this.