-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext. #1679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-13934. Multipart uploaders to be created through FileSystem/FileContext. #1679
Conversation
|
It's too early to worry about check style failures; I'd like API reviews first. Thanks. One thing we haven't covered here, is what to do about parent directories. Although it is not needed for S3, I would like to say "Parent directory must exist". Then the S3A uploader would add a specific option to disable this check. Why so? Because for real file systems you want to specify at the permissions of the parent directory, and I don't want to start adding that to the API given that mkdirs is there. Note also, that while this API it would seem sufficient to reimplement the S3A committers, in HADOOP-15183 we added a For this multipart uploader to scale we'd have to call Important: we would need the same four copy operations, again to avoid excessive I/O. if I am copying 100 files, I don't want to make 100 *depth(file) calls to S3Guard. |
|
test failures are the filter and Har FSs discovering they've a new method they need to handle. I Was expecting those, just didn't remember which tests they'd be. BTW, for anyone wondering -yes, you could re-implement the S3A committers through this API... It would be good to have the ability to abort pending uploads under a specific path, which is how we do job aborts -we need that as task failures can stop them passing on the details of the pending uploads. |
b79900d to
7e3847f
Compare
7e3847f to
aa31af9
Compare
|
FS API changes wrong |
|
API needs to let you declare a file and (offset, range) for uploads too. Why so? If the client can stream direct without having to load it, (AWS SDK client) then you let it handle the load. And equally critical for S3, the AWS SDK Can do a retry on the failure of a single post. |
ajfabbri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Some minor comments. So the FS concat-based multipart uploader is currently synchronous and the S3A one is async right? Just making sure I'm reading the code right.
| * be vulnerable to eventually consistent listings of current uploads | ||
| * -some may be missed. | ||
| * @param path path to abort uploads under. | ||
| * @return a future of the number of entries found; 1 if aborting is unsupported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
...oject/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/FileSystemMultipartUploader.java
Outdated
Show resolved
Hide resolved
| Map<Integer, PartHandle> handleMap, | ||
| UploadHandle multipartUploadId) throws IOException { | ||
|
|
||
| checkPath(filePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller already did this.
...oject/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/FileSystemMultipartUploader.java
Show resolved
Hide resolved
|
|
||
| /** Header for serialized Parts: {@value}. */ | ||
|
|
||
| public static final String HEADER = "S3A-part01"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do part numbers start with 01 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, just wanted something I could serialize and this is version 01 of the header
|
also, we should let the builder take a |
no: better to do that slow upload in a separate thread. |
|
aaron -thanks for the review. I'll do the tweaks you suggested, fix checkstyle and once the tests are happy, merge |
|
need to rebase |
…Context. This is my proposal for an API for multipart uploads through the FS, rather than via service loading. It builds on the work done in createFile() and openFile() for builders and futures. * a builder API for configuring the uploader, including specifying the permissions of uploaded files. * MultipartUploader is now interface; org.apache.hadoop.fs.impl.AbstractMultipartUploader is the base class for implementations to extend. * all the operations in the interface now return FutureCompletable<>; implementations can perform the operations synchronously or asynchronously. * there is a new common path capability for the hasPathCapabilities() probe. FileSystemMultipartUploader and its builder moved to fs.impl; I needed a class org.apache.hadoop.fs.InternalOperations to get at rename/3 from the new package. It really is time to make the good rename method public. S3AMultipartUploader moved to fs.s3a.impl, alongside its builder. The uploader does not get a reference to the base FS; it gets a StoreContext and WriteOperationHelper and has to work with them. All operations and now async and executed in the executor offered by the StoreContext. Tests are all fixed up to cope with the changed API. The builders can be created from the file system; we can also extend file context and the filter classes... Though I would not pass the operation through checksum FS as the uploads will not be checksummed. One thing I am unsure about is viewFS: even though we can do with the remapping when the builder is created, the actual uploads need the full path in the store. That is unless, this and the WiP copy operation both take a path remapper in the builder which is then used to map from View FS paths to ones in their store. Suggestions here are welcome. No changes into the specification; yet. Let's get feedback first. Change-Id: Ib526fe6db8e9282c634181be231d4124ea6c78fc
excessive DDB IO. Change-Id: Icbab3ef1e93ad3da87a3b29d11698213d1f0e26c
This is ready to g in pending review and the final builder spec. Specification: * update the docs to all but the builder * builder has interface/impl split; extends existing builder implementation * FileContext support * add an optional abortUploadsUnderPath(); invaluable for S3A. Big rework of S3A implementation * has new S3AMultipartUploaderStatistics interface (implementation from Instrumentation) * factor WriteOperations interface out of WriteOperationsHelper and use that * Found and fixed DDB s3guard bug which would fail with >1 file write to the same test. Doesn't show up in existing workflows, but you can upload two files to the same test, and then it surfaces. Tests * more! Change-Id: I1b4aefef51141a53d755e96628d68cebf285f06c
especially patch qualification. I had to add a new operation to the S3A StoreContext class; took the opportunity to move to a builder API. Why so -even though I'd argue before we didn't need it? Well, because S3A DelegationTokens plugin point now takes it, so must their tests. also tagged as @limitedPrivate for the same reason. Thinking maybe I should separate that change -and avoid having a guava class in the fields ...won't work with shading, will it? Change-Id: I19e8f5219d6fe51002c97694af8001453d8eb6b0
0873791 to
1d37e6a
Compare
- cut disabled local FS test - finish rebase compilation issues s3a implementation -byte array payload to include path and upload ID, for extra robustness. You can't accidentally mix uploads. -statistics also tracks bytes uploaded through the API -S3AMultipartUploaderStatisticsImpl pulled out from Instrumentation and integrates with StorageStatistics and Instrumentation through a single callback. Obviously, MPU stats will be an IOStatisticsSource. Change-Id: Iec06ddeca885ae174ac0e95c9dab6e28ec8113ea
|
test s3 ireland; unguarded faiures in ITestS3AConfiguration as expected errors didnt surface -probably because I've disabled bucket existence checks |
How could I miss these? Change-Id: If6abde1e09c162fddeeeecb8059ad169dceaa2fc
|
Fixes TestFilterFileSystem to know that the multipart upload API is not passed through. This is to stop checksum FS being bypassed when files are created through the API. FilterFileSystem.hasPathCapability() always returns false for this capability, irrespective of what the inner class says. Update FileSystem class javadocs to add these tasks to the list of required changes, and improve its formatting. Change-Id: I6334f295111241850ea63677eae0bf9288985086
line length, indentation. store context builder was on len and param names. Change-Id: I8302097e29bf22e22739bf6a8341eabc5916be45
|
clearly unrelated failure |
|
and some minor checkstyles the javadoc one is fixed in the iostats s3 side patch |
ajfabbri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit looks good
ajfabbri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit looks good.
ajfabbri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Latest updated patch LGTM. Didn't run integration tests (don't have a free S3 account).
|
thanks! |
This is my proposal for an API for multipart uploads through the FS, rather than via service loading.
It builds on the work done in createFile() and openFile() for builders and futures.
FileSystemMultipartUploader and its builder moved to fs.impl; I needed a class org.apache.hadoop.fs.InternalOperations to get at rename/3 from the new package. It really is time to make the good rename method public.
S3AMultipartUploader moved to fs.s3a.impl, alongside its builder. The uploader does not get a reference to the base FS; it gets a StoreContext and WriteOperationHelper and has to work with them. All operations and now async and executed in the executor offered by the StoreContext.
Tests are all fixed up to cope with the changed API.
The builders can be created from the file system; we can also extend file context and the filter classes... Though I would not pass the operation through checksum FS as the uploads will not be checksummed.
One thing I am unsure about is viewFS: even though we can do with the remapping when the builder is created, the actual uploads need the full path in the store. That is unless, this and the WiP copy operation both take a path remapper in the builder which is then used to map from View FS paths to ones in their store. Suggestions here are welcome.
No changes into the specification; yet. Let's get feedback first.
Change-Id: Ib526fe6db8e9282c634181be231d4124ea6c78fc