-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS subproject and initial S3FileIO implementation #1573
Conversation
Will this not use MultiPartUploads? And what about the relatively new TransferManager, which will upload multipart uploads at specific byte lengths and then manage their commit altogether (and also ensure that readers can pull from S3 at the proper byte lengths for things like footers etc). |
Has there been any consideration to using the AWS Java SDK v2? I know that @jacques-n mentioned they found the java v2 Async s3 SDK buggy, but the linked code from their project is using the AWS Java SDK v2 (all of the imports start with To me it seems like it would be smarter to start on the newer client version than have to do an upgrade later. My understanding is that the Java SDK V2 is much more performant for most things as its the one seeing most of the work. And though I don't doubt @jacques-n's performance / bug issues with the java sdk v2 async s3 client, but I would ask when that was? I've personally noticed that when new clients and new services are brought out by amazon, they're not always production ready from the start. But many times I've found that things we performance tested 6 months prior were much more performant / resilient later on. |
@kbendick For the multipart upload/transfer manager, I do think we should toward something like that, but I'm trying to keep this initial implementation somewhat simple so it doesn't become a huge commit. We actually used to use the transfer manager, but the last time I looked, it required writing out the full file and then initiate the upload. What we ended up doing was building a progressive upload so that it would split up the stream and upload in parts as the stream was being written to. I think some of these optimization should be done as follow up. |
I'm on board with doing these optimizations as follow ups 👍 . And admittedly I've only used the transfer manager for downloads. You're right, it appears it still requires you to write the full file out. My experience in the realm of having a transfer manager like resource for concurrent multipart uploads from streams is limited to python. I'm definitely good with keeping things simple to start. |
@HeartSaVioR, this would fix that problem because overwrite will simply put the file, rather than checking to see what the path is first. That is the same approach that we use in our S3 file system, but we primarily use HadoopFileIO with a custom S3 file system, rather than a FileIO built for S3 because we've been using the FS for a long time. |
Ah OK, you're having custom "file system" implementation for S3. Thanks for explanation and happy to see this. |
I would say that I've also had issues when exploring the v2 sdk, but more in terms of completeness of the implementation. For example they don't have transfer manager (not that we're using here), but if we decide to go that route, we would need to go back to v1. Also, at this point most other systems (Spark, S3A, Presto, etc.) are still on v1 as well. If there are documented performance or other features in v2, I'd be happy to upgrade, but it seems like the community hasn't really moved that direction yet. |
The SDK v2 is intended to live together with v1 because some old packages such as S3AFileSystem might never upgrade. That is why they have completely different class path and you do not need to resolve any dependency conflicts. All the new features related to the client itself will only be developed in v2, so it is always recommended to use the v2 client when possible for new projects. There is a blog that is dedicated to new features added to v2. For performance, there are optimizations made for users in AWS Lambda environment based on this doc. There is no performance benchmark done for HTTP calls, but since v2 supports HTTP2, it is supposed to be faster when the service enables HTTP2 traffic. From feature perspective, yes the transfer manager is not there, but for Iceberg the most important feature should be the multipart upload which is there, so I see much more benefits to use v2 instead of v1. |
@jackye1995 and @kbendick I think you both make a good case for going forward with v2. After looking into the docs a little, the main point for me was that it provides more control over the underlying http transport, which we may want to take advantage of. The usage in the initial version is pretty basic and should be trivial to switch over to v2 (I have no objection to doing that). The main issue is testing support for the s3mock library with junit4 doesn't appear to support v2 so we would need to switch to junit5. (Is there better testing harnesses in/for v2?). @rdblue do you have any objection to using junit5 for this module? |
I think it would be fine to use JUnit5. We've been considering it for Hive testing as well so we can improve how we parameterize the HiveRunner tests. |
What do you mean by S3Mock library not supporting v2? S3Mock runs as a server, and a client can simply override the endpoint to localhost to communicate with the server. The version of the client should not matter. server: S3MockApplication.start("--server.port=8001"); client: s3 = S3Client.builder()
.endpointOverride(URI.create("http://localhost:8001"))
.region(Region.US_EAST_1) // dummy region
.credentialsProvider(StaticCredentialsProvider.create(
AwsBasicCredentials.create("key", "secret"))) // dummy credential
.build(); |
@jackye1995 Sorry, my comment about S3Mock wasn't that it was not supported with v2, but rather the docs for S3Mock say integration exists with JUnit4 with v1 and JUnit5 with v1 and v2. I didn't mean to imply that it didn't work, more that the current unit tests need to be updated to either use S3Mock like your example or switch to JUnit5 and use the extensions. I might be able to use a combination of the JUnit4 Rule with a custom client configured as well. It's not a big change overall, but I was also interested as to whether there was something provided by the sdkv2 that provides something similar to s3mock for testing. Thanks for the example, I'll give that a shot and see if I can keep the existing JUnit4 tests. |
@jackye1995 and @kbendick I updated to aws java sdk v2 for almost everything. I do feel the new api has a few rough edges compared to sdk v1 (e.g. range() specification is less intuitive, missing URI parsing functionality like AmazonS3URI), but overall it wasn't much effort. I'm still using v1 for URI parsing (if you're aware of something better in v2, please let me know). Turns out the tests were easier than I thought because S3Mock had a v2 client that just wasn't called out in the docs. Any feedback is appreciated. |
Dan and I had a discussion about |
I had to make some significant changes in order to make create work in what seems like a reasonable way, but there's lots of behavioral things to consider. For example, S3InputFile caches the exists call (as does the S3OutputFile now). I believe we do that to optimize for a traditional exists() -> contentLength() -> open() pattern so that we don't make multiple requests to get the same information from S3. However, this may not make sense in the create path. This also exposes some strange aspects of the FileIO api in that |
@jackye1995 and @rdblue I created a very basic There are some advantages to leaving this simple because there are a number of technically valid schemes (e.g. https, s3, s3a, s3n (we still have this form) and may even support signed URLs (didn't look into that). Since we're already assuming that this will only be used for S3, maybe we don't want to be too opinionated about the validation. Thoughts? |
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
aws/src/main/java/org/apache/iceberg/aws/s3/S3OutputStream.java
Outdated
Show resolved
Hide resolved
Looks like a bug in an error message caught by errorprone:
|
Interesting, I ran checks locally and it didn't hit (maybe because I was just running build for the iceberg-aws project). Pushed a fix for that. |
Looks like this PR is close to get merged. I will wait on the Glue side and switch to use S3FileIO as the default IO. And just FYI, I also have the KMS and multipart upload features ready, and will create PRs once this one is pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good to me. I'll merge this.
I resolved the conflicts with #1728 in LICENSE and versions.props. Just waiting for the checks to pass and we can merge. |
Merged! Thanks for the comments and reviews, everyone! |
This is a basic initial implementation of an S3 based FileIO implementation that includes the reading and writing objects using
putObject
andgetObject
.