Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWS subproject and initial S3FileIO implementation #1573

Merged
merged 9 commits into from
Nov 7, 2020

Conversation

danielcweeks
Copy link
Contributor

This is a basic initial implementation of an S3 based FileIO implementation that includes the reading and writing objects using putObject and getObject.

@danielcweeks danielcweeks requested a review from rdblue October 9, 2020 23:17
@kbendick
Copy link
Contributor

This is a basic initial implementation of an S3 based FileIO implementation that includes the reading and writing objects using putObject and getObject.

Will this not use MultiPartUploads? And what about the relatively new TransferManager, which will upload multipart uploads at specific byte lengths and then manage their commit altogether (and also ensure that readers can pull from S3 at the proper byte lengths for things like footers etc).

@kbendick
Copy link
Contributor

Has there been any consideration to using the AWS Java SDK v2? I know that @jacques-n mentioned they found the java v2 Async s3 SDK buggy, but the linked code from their project is using the AWS Java SDK v2 (all of the imports start with software.amazon).

To me it seems like it would be smarter to start on the newer client version than have to do an upgrade later. My understanding is that the Java SDK V2 is much more performant for most things as its the one seeing most of the work. And though I don't doubt @jacques-n's performance / bug issues with the java sdk v2 async s3 client, but I would ask when that was? I've personally noticed that when new clients and new services are brought out by amazon, they're not always production ready from the start. But many times I've found that things we performance tested 6 months prior were much more performant / resilient later on.

@danielcweeks
Copy link
Contributor Author

@kbendick For the multipart upload/transfer manager, I do think we should toward something like that, but I'm trying to keep this initial implementation somewhat simple so it doesn't become a huge commit.

We actually used to use the transfer manager, but the last time I looked, it required writing out the full file and then initiate the upload. What we ended up doing was building a progressive upload so that it would split up the stream and upload in parts as the stream was being written to.

I think some of these optimization should be done as follow up.

@kbendick
Copy link
Contributor

kbendick commented Oct 10, 2020

@kbendick For the multipart upload/transfer manager, I do think we should toward something like that, but I'm trying to keep this initial implementation somewhat simple so it doesn't become a huge commit.

We actually used to use the transfer manager, but the last time I looked, it required writing out the full file and then initiate the upload. What we ended up doing was building a progressive upload so that it would split up the stream and upload in parts as the stream was being written to.

I think some of these optimization should be done as follow up.

I'm on board with doing these optimizations as follow ups 👍 .

And admittedly I've only used the transfer manager for downloads. You're right, it appears it still requires you to write the full file out. My experience in the realm of having a transfer manager like resource for concurrent multipart uploads from streams is limited to python.

I'm definitely good with keeping things simple to start.

@HeartSaVioR
Copy link
Contributor

I haven't look into details (and I'm novice for AWS env.) but just wanted to check: does this address #1398? I see @rdblue said "our S3 file system ... avoided this" and I'm curious whether this is the S3 file system implementation mentioned before, or another one.

@rdblue
Copy link
Contributor

rdblue commented Oct 12, 2020

@HeartSaVioR, this would fix that problem because overwrite will simply put the file, rather than checking to see what the path is first. That is the same approach that we use in our S3 file system, but we primarily use HadoopFileIO with a custom S3 file system, rather than a FileIO built for S3 because we've been using the FS for a long time.

@HeartSaVioR
Copy link
Contributor

Ah OK, you're having custom "file system" implementation for S3. Thanks for explanation and happy to see this.

@danielcweeks
Copy link
Contributor Author

Has there been any consideration to using the AWS Java SDK v2? I know that @jacques-n mentioned they found the java v2 Async s3 SDK buggy, but the linked code from their project is using the AWS Java SDK v2 (all of the imports start with software.amazon).

To me it seems like it would be smarter to start on the newer client version than have to do an upgrade later. My understanding is that the Java SDK V2 is much more performant for most things as its the one seeing most of the work. And though I don't doubt @jacques-n's performance / bug issues with the java sdk v2 async s3 client, but I would ask when that was? I've personally noticed that when new clients and new services are brought out by amazon, they're not always production ready from the start. But many times I've found that things we performance tested 6 months prior were much more performant / resilient later on.

I would say that I've also had issues when exploring the v2 sdk, but more in terms of completeness of the implementation. For example they don't have transfer manager (not that we're using here), but if we decide to go that route, we would need to go back to v1. Also, at this point most other systems (Spark, S3A, Presto, etc.) are still on v1 as well. If there are documented performance or other features in v2, I'd be happy to upgrade, but it seems like the community hasn't really moved that direction yet.

@jackye1995
Copy link
Contributor

Has there been any consideration to using the AWS Java SDK v2? I know that @jacques-n mentioned they found the java v2 Async s3 SDK buggy, but the linked code from their project is using the AWS Java SDK v2 (all of the imports start with software.amazon).
To me it seems like it would be smarter to start on the newer client version than have to do an upgrade later. My understanding is that the Java SDK V2 is much more performant for most things as its the one seeing most of the work. And though I don't doubt @jacques-n's performance / bug issues with the java sdk v2 async s3 client, but I would ask when that was? I've personally noticed that when new clients and new services are brought out by amazon, they're not always production ready from the start. But many times I've found that things we performance tested 6 months prior were much more performant / resilient later on.

I would say that I've also had issues when exploring the v2 sdk, but more in terms of completeness of the implementation. For example they don't have transfer manager (not that we're using here), but if we decide to go that route, we would need to go back to v1. Also, at this point most other systems (Spark, S3A, Presto, etc.) are still on v1 as well. If there are documented performance or other features in v2, I'd be happy to upgrade, but it seems like the community hasn't really moved that direction yet.

The SDK v2 is intended to live together with v1 because some old packages such as S3AFileSystem might never upgrade. That is why they have completely different class path and you do not need to resolve any dependency conflicts.

All the new features related to the client itself will only be developed in v2, so it is always recommended to use the v2 client when possible for new projects. There is a blog that is dedicated to new features added to v2.

For performance, there are optimizations made for users in AWS Lambda environment based on this doc. There is no performance benchmark done for HTTP calls, but since v2 supports HTTP2, it is supposed to be faster when the service enables HTTP2 traffic.

From feature perspective, yes the transfer manager is not there, but for Iceberg the most important feature should be the multipart upload which is there, so I see much more benefits to use v2 instead of v1.

@danielcweeks
Copy link
Contributor Author

@jackye1995 and @kbendick I think you both make a good case for going forward with v2. After looking into the docs a little, the main point for me was that it provides more control over the underlying http transport, which we may want to take advantage of.

The usage in the initial version is pretty basic and should be trivial to switch over to v2 (I have no objection to doing that).

The main issue is testing support for the s3mock library with junit4 doesn't appear to support v2 so we would need to switch to junit5. (Is there better testing harnesses in/for v2?). @rdblue do you have any objection to using junit5 for this module?

@rdblue
Copy link
Contributor

rdblue commented Oct 14, 2020

I think it would be fine to use JUnit5. We've been considering it for Hive testing as well so we can improve how we parameterize the HiveRunner tests.

@jackye1995
Copy link
Contributor

@jackye1995 and @kbendick I think you both make a good case for going forward with v2. After looking into the docs a little, the main point for me was that it provides more control over the underlying http transport, which we may want to take advantage of.

The usage in the initial version is pretty basic and should be trivial to switch over to v2 (I have no objection to doing that).

The main issue is testing support for the s3mock library with junit4 doesn't appear to support v2 so we would need to switch to junit5. (Is there better testing harnesses in/for v2?). @rdblue do you have any objection to using junit5 for this module?

What do you mean by S3Mock library not supporting v2? S3Mock runs as a server, and a client can simply override the endpoint to localhost to communicate with the server. The version of the client should not matter.

server:

S3MockApplication.start("--server.port=8001");

client:

s3 = S3Client.builder()
            .endpointOverride(URI.create("http://localhost:8001"))
            .region(Region.US_EAST_1) // dummy region
            .credentialsProvider(StaticCredentialsProvider.create(
                    AwsBasicCredentials.create("key", "secret"))) // dummy credential
            .build();

@danielcweeks
Copy link
Contributor Author

@jackye1995 Sorry, my comment about S3Mock wasn't that it was not supported with v2, but rather the docs for S3Mock say integration exists with JUnit4 with v1 and JUnit5 with v1 and v2. I didn't mean to imply that it didn't work, more that the current unit tests need to be updated to either use S3Mock like your example or switch to JUnit5 and use the extensions.

I might be able to use a combination of the JUnit4 Rule with a custom client configured as well. It's not a big change overall, but I was also interested as to whether there was something provided by the sdkv2 that provides something similar to s3mock for testing.

Thanks for the example, I'll give that a shot and see if I can keep the existing JUnit4 tests.

@danielcweeks
Copy link
Contributor Author

danielcweeks commented Oct 15, 2020

@jackye1995 and @kbendick I updated to aws java sdk v2 for almost everything. I do feel the new api has a few rough edges compared to sdk v1 (e.g. range() specification is less intuitive, missing URI parsing functionality like AmazonS3URI), but overall it wasn't much effort. I'm still using v1 for URI parsing (if you're aware of something better in v2, please let me know). Turns out the tests were easier than I thought because S3Mock had a v2 client that just wasn't called out in the docs.

Any feedback is appreciated.

@rdblue
Copy link
Contributor

rdblue commented Oct 29, 2020

Dan and I had a discussion about create and came to an agreement to implement it, but to make sure the docs for it don't make claims about atomicity. It is still useful to have it because some engines may use create without overwrite to write data files. And although there are no guarantees when used concurrently, at least only one writer wins. We've had issues with concurrent writes corrupting version-hint.txt files in other file systems, so a lack of guarantees isn't unique to S3.

@danielcweeks
Copy link
Contributor Author

Dan and I had a discussion about create and came to an agreement to implement it, but to make sure the docs for it don't make claims about atomicity. It is still useful to have it because some engines may use create without overwrite to write data files. And although there are no guarantees when used concurrently, at least only one writer wins. We've had issues with concurrent writes corrupting version-hint.txt files in other file systems, so a lack of guarantees isn't unique to S3.

I had to make some significant changes in order to make create work in what seems like a reasonable way, but there's lots of behavioral things to consider. For example, S3InputFile caches the exists call (as does the S3OutputFile now). I believe we do that to optimize for a traditional exists() -> contentLength() -> open() pattern so that we don't make multiple requests to get the same information from S3. However, this may not make sense in the create path.

This also exposes some strange aspects of the FileIO api in that deleteFile is a function of FileIO and exists() is a function of InputFile (though not OutputFile).

@danielcweeks
Copy link
Contributor Author

@jackye1995 and @rdblue I created a very basic S3URI class to replace the AmazonS3URI. For now, it really makes not assumptions other than it is a valid URI and the authority equates to a bucket and path equates to key.

There are some advantages to leaving this simple because there are a number of technically valid schemes (e.g. https, s3, s3a, s3n (we still have this form) and may even support signed URLs (didn't look into that). Since we're already assuming that this will only be used for S3, maybe we don't want to be too opinionated about the validation. Thoughts?

@danielcweeks danielcweeks requested a review from rdblue November 6, 2020 18:46
@github-actions github-actions bot added the build label Nov 6, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 6, 2020

Looks like a bug in an error message caught by errorprone:

/home/runner/work/iceberg/iceberg/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java:64: error: [FormatStringAnnotation] extra format arguments: used 0, provided 1
    ValidationException.check(VALID_SCHEMES.contains(scheme.toLowerCase()), "Invalid scheme: ", scheme);

@danielcweeks
Copy link
Contributor Author

Looks like a bug in an error message caught by errorprone:

/home/runner/work/iceberg/iceberg/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java:64: error: [FormatStringAnnotation] extra format arguments: used 0, provided 1
    ValidationException.check(VALID_SCHEMES.contains(scheme.toLowerCase()), "Invalid scheme: ", scheme);

Interesting, I ran checks locally and it didn't hit (maybe because I was just running build for the iceberg-aws project). Pushed a fix for that.

@jackye1995
Copy link
Contributor

Looks like this PR is close to get merged. I will wait on the Glue side and switch to use S3FileIO as the default IO. And just FYI, I also have the KMS and multipart upload features ready, and will create PRs once this one is pushed.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me. I'll merge this.

@rdblue
Copy link
Contributor

rdblue commented Nov 7, 2020

I resolved the conflicts with #1728 in LICENSE and versions.props. Just waiting for the checks to pass and we can merge.

@rdblue rdblue merged commit 2b9234f into apache:master Nov 7, 2020
@rdblue
Copy link
Contributor

rdblue commented Nov 7, 2020

Merged! Thanks for the comments and reviews, everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants