-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17198 Support S3 AccessPoint #3260
HADOOP-17198 Support S3 AccessPoint #3260
Conversation
Turns out merge with |
Alright, ran the tests in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the ARN option MUST ONLY be set ona per-bucket basis; a global one makes no sense.
Could be done in FS.initialize by constructing the final name and querying.
InternalConstants.ARN_BUCKET_OPTION = "fs.s3a.%s.access-point.arn"
// initialize
arn = String.format(ARN_BUCKET_OPTION, getHost())
For tests, removeBaseAndBucketOverrides() will unset it for the test bucket:
removeBaseAndBucketOverrides(conf, "fs.s3a.access-point.arn")
// and to set it:
conf.set(String.format(ARN_BUCKET_OPTION, getHost(), "something"))
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ArnResource.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
@@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, | |||
OBJECT_CONTINUE_LIST_REQUEST, | |||
() -> { | |||
if (useListV1) { | |||
if (accessPoint != null) { | |||
// AccessPoints are not compatible with V1List | |||
throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think maybe during initialize it should just downgrade?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good idea, upgrading it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(side issue: reviewing snowball support. it's a v1 API only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we could just fail and let whoever is editing the settings deal with it. v1 is not the default, and the only place we recommend it is for 3rd party implementations. If someone changes the list option, things fail.
but propose: including the config option in the text, e.g.
"v1 list API configured in" + LIST_VERSION + " is not supported by access points"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I upgraded list to V2 in initialize
method. I'm thinking I should make that logging more explicit there and completely remove this extra check + throw InvalidRequestException
. Should be enough right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not, remove that and skip all V1 list tests (since they're using V2 anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've reverted this & letting the SDK fail it. worksforme
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
Outdated
Show resolved
Hide resolved
...s/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContractGetFileStatusV1List.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestArnResource.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ArnResource.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Outdated
Show resolved
Hide resolved
</property> | ||
``` | ||
|
||
While keeping the global `accesspoint.arn` property set to empty `" "` which is the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed it to be "" above so the docs should be consistent as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my fault for mixing it up. I thought the default was " "
not ""
for properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes " " can be set in a conf to force an override which getTrimmed() will then downgrade to "". No need to worry about that in these docs
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
...tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AClientSideEncryptionKms.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestUtils.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestArnResource.java
Outdated
Show resolved
Hide resolved
...tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/AbstractS3ACostTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a small/quick review, because I am seeing some failures with these conditions:
- AP set, S3Guard ON.
❯ bin/hadoop fs -ls s3a://mehakmeet-singh-data/
2021-08-06 13:38:05,104 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-08-06 13:38:05,259 INFO s3a.S3AFileSystem: Using AccessPoint ARN "arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap" for bucket mehakmeet-singh-data
2021-08-06 13:38:05,337 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-06 13:38:05,394 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-06 13:38:05,394 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
2021-08-06 13:38:06,593 ERROR s3guard.S3Guard: Failed to instantiate metadata store org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore defined in fs.s3a.metadatastore.impl: java.nio.file.AccessDeniedException: arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User: arn:aws:iam::152813717728:user/mehakmeet.singh is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:ap-south-1:152813717728:table/arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException; Request ID: PRP7H8KJ3K5U71T3FDC046F30RVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
java.nio.file.AccessDeniedException: arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap: com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User: arn:aws:iam::152813717728:user/mehakmeet.singh is not authorized to perform: dynamodb:DescribeTable on resource: arn:aws:dynamodb:ap-south-1:152813717728:table/arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap (Service: AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException; Request ID: PRP7H8KJ3K5U71T3FDC046F30RVV4KQNSO5AEMVJF66Q9ASUAAJG; Proxy: null)
at org.apache.hadoop.fs.s3a.S3AUtils.translateDynamoDBException(S3AUtils.java:460)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:216)
at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStoreTableManager.initTable(DynamoDBMetadataStoreTableManager.java:237)
at org.apache.hadoop.fs.s3a.s3guard.DynamoDBMetadataStore.initialize(DynamoDBMetadataStore.java:441)
at org.apache.hadoop.fs.s3a.s3guard.S3Guard.getMetadataStore(S3Guard.java:125)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:553)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3460)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:172)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3565)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3512)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:539)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:366)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:342)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:252)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:235)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:105)
at org.apache.hadoop.fs.shell.Command.run(Command.java:179)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:327)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:95)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:390)
This seems more of some missing Access Point policy I need to add in AWS console?
- AP set, endpoint(
fs.s3a.endpoint
) set.
❯ bin/hadoop fs -ls s3a://mehakmeet-singh-data/
2021-08-06 13:39:57,198 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-08-06 13:39:57,361 INFO s3a.S3AFileSystem: Using AccessPoint ARN "arn:aws:s3:ap-south-1:152813717728:accesspoint/mmt-ap" for bucket mehakmeet-singh-data
2021-08-06 13:39:57,446 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-08-06 13:39:57,755 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-08-06 13:39:57,756 INFO impl.MetricsSystemImpl: s3a-file-system metrics system started
ls: `s3a://mehakmeet-singh-data/': listStatus on s3a://mehakmeet-singh-data/: com.amazonaws.services.s3.model.AmazonS3Exception: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: RQQ71TN58GMAGX0H; S3 Extended Request ID: OlJ3rlBs9LPIwIOU6IhT8dGCFEZCqVGVRX00PFOZuq6ZTvNFvB2XkNFkj7U0ovDpucdCHtNYgTA=; Proxy: null), S3 Extended Request ID: OlJ3rlBs9LPIwIOU6IhT8dGCFEZCqVGVRX00PFOZuq6ZTvNFvB2XkNFkj7U0ovDpucdCHtNYgTA=:NoSuchBucket: The specified bucket does not exist (Service: Amazon S3; Status Code: 404; Error Code: NoSuchBucket; Request ID: RQQ71TN58GMAGX0H; S3 Extended Request ID: OlJ3rlBs9LPIwIOU6IhT8dGCFEZCqVGVRX00PFOZuq6ZTvNFvB2XkNFkj7U0ovDpucdCHtNYgTA=; Proxy: null)
2021-08-06 13:39:58,942 INFO impl.MetricsSystemImpl: Stopping s3a-file-system metrics system...
2021-08-06 13:39:58,942 INFO impl.MetricsSystemImpl: s3a-file-system metrics system stopped.
2021-08-06 13:39:58,942 INFO impl.MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
Had the property set as: fs.s3a.endpoint=s3.ap-south-1.amazonaws.com
, removing this, works. Maybe something with the overriding/setting to default of endpoint in S3AFileSystem.java
?
...tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/AbstractS3ACostTest.java
Outdated
Show resolved
Hide resolved
// If there's no endpoint set, then use the default for bucket or AccessPoint. Overriding is | ||
// useful when using FIPS or DualStack S3 endpoints. | ||
String endpoint = conf.getTrimmed(ENDPOINT, ""); | ||
if (endpoint.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you meant to check for !endpoint.isEmpty()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, what I initially intended is to say "if you're not setting the endpoint then I'll provide a default Access Point endpoint". This is because I don't know what endpoint the user wants to point it to.
This is also why your tests are failing when you set the endpoint to ap-south-1
. I'm open to adding another fs.s3a.accesspoint-endpoint
configuration if it's better to provide an option to override only the access point endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, basically, if we have an Accesspoint set, we can't have an endpoint set as well? or we would have to change something like s3.ap-south-1.amazonaws.com
to s3-accesspoint.ap-south-1.amazonaws.com
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I thought about this a bit more and decided to change to always use the access point endpoint instead of the above logic. So right now they should work even if you have set a custom endpoint to be something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general review.
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/AbstractS3ATestBase.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ArnResource.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/ArnResource.java
Outdated
Show resolved
Hide resolved
if (service.contains("s3-accesspoint") || service.contains("s3-outposts") || | ||
service.contains("s3-object-lambda")) { | ||
// If AccessPoint then bucketName is of format `accessPoint-accountId`; | ||
String[] accessPointBits = hostBits[0].split("\\-"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if AccessPoint Name have a "-" in it?
for eg: AP name = "mmt-ap", then bucketName = "mmt-ap-ACCOUNT_ID"
, with this split, we would end up with "mmt" as accessPointName
and "ap" as accountID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AConfiguration.java
Outdated
Show resolved
Hide resolved
...s/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AContractGetFileStatusV1List.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
skipIfKmsKeyIdIsNotSet(c); | ||
skipIfCSEIsNotEnabled(c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like my code had a bug in it. This should be skipping if CSE is enabled or if KMS key is not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
...adoop-aws/src/test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionSSEKMSUserDefinedKey.java
Outdated
Show resolved
Hide resolved
*/ | ||
public static void skipIfCSEIsNotEnabled(Configuration configuration) { | ||
String encryption = configuration.get(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM, ""); | ||
if (!encryption.equals(S3AEncryptionMethods.CSE_KMS.getMethod())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to skip if CSE is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand. Why would I change this method to skip if CSE is enabled? And the places where it's called (even though I added it you have the better context, so trying to understand it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests that require this are SSE tests which shouldn't skip if CSE-KMS is the encryption method used. It should've been skipped if CSE-KMS is enabled, to tell the user that SSE-KMS is not the method used.
@mehakmeet APs are not supposed to work with S3 Guard. I'll update the documentation to point that out. |
Shouldn't we just throw an exception during initialization if AccessPoint is set and S3Guard is enabled to tell the users that AccessPoints are incompatible with S3Guard? |
+1; what we do with CSE. |
Sorry for being inactive on this PR for this long, but I'm back! I went through and addressed all the comments (hopefully didn't miss any). Thanks to everyone that did the initial review, really appreciated it! Since it touches CSE as well I ran the integration tests with and without CSE enabled just to make sure things are good. There's one more small commit coming and that's about it! |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Alright, all tests pass now. This PR should be ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, minor details in tests, docs and error messages left to tune
``` | ||
|
||
Before using Access Points make sure you're not impacted by the following: | ||
- `ListObjectsV1` is not supported, arguably you shouldn't use it if you can; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut the "arguably" as it will only puzzle the reader. Best to say "this is deprecated on AWS S3 for performance reasons"
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/S3ATestUtils.java
Outdated
Show resolved
Hide resolved
*/ | ||
public static void maybeSkipIfS3GuardAndS3CSEIOE(PathIOException ioe) | ||
public static void maybeSkipIfIOEContainsMessage(PathIOException ioe, String ...messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, remove the maybe
as the if
indicates it happens sometimes
// Skip if CSE is not configured as an algorithm | ||
String encryption = getConfiguration().get(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM, ""); | ||
if (!encryption.equals(S3AEncryptionMethods.CSE_KMS.getMethod())) { | ||
skip("CSE encryption has been set"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error text is wrong
+ "SSE-KMS"); | ||
|
||
skipIfKmsKeyIdIsNotSet(c); | ||
// FS is not available at this point so checking CSE like this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just call skipIfCSEIsEnabled
@@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, | |||
OBJECT_CONTINUE_LIST_REQUEST, | |||
() -> { | |||
if (useListV1) { | |||
if (accessPoint != null) { | |||
// AccessPoints are not compatible with V1List | |||
throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think we could just fail and let whoever is editing the settings deal with it. v1 is not the default, and the only place we recommend it is for 3rd party implementations. If someone changes the list option, things fail.
but propose: including the config option in the text, e.g.
"v1 list API configured in" + LIST_VERSION + " is not supported by access points"
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
Outdated
Show resolved
Hide resolved
Thanks, updated them. Removed the extra check + throw for the V1 listing. Let me know if I should remove the upgrade in |
I'm happy with this; don't see any obvious regressions. One thing (and I've suggested it to mehakmeet for the CSE work) is mentioning AP testing in the testing docs, especially qualifying SDK updates. It's going to be hard as you'll need one set up (I don't have locally...not sure if we have one on our VPN), so it should be something like: You SHOULD run tests against an S3 access point if you have the setup to do so. |
had another thought. What if we had an option to require access points? You could then set that globally and it would be an error to try and connect to any bucket which didn't have an AP ARN defined., something like fs.s3a.access.point.required |
That's an interesting idea 🤔 . It definitely sounds like a great way to improve security in a VPN from the software stack. What's the general way to approach this, add it to this PR or come back later with another PR and only then cherry pick both into release branch? I prefer smaller changes but don't mind adding it here either.
You don't need to be inside a VPN to test access points. You only need an access point created for any bucket and you're good to go. Of course, if you have the set up and want to test it there, that's great but what you're testing then is more on AWS integration than
|
afraid @mehakmeet's patches have broken this, sorry |
Sorry about that 😅. I have refactored the CSE skip conditions in a tuning patch so you can just remove those changes from this patch and rebase once. |
28590ea
to
b0fa6e5
Compare
@mehakmeet thanks for letting me know! (should've been in a different patch the first time, but I thought I could sneak them in this one 😓) @steveloughran rebased, reran the entire suite and everything looks to be OK now 👍 |
did something change with the build system? |
💔 -1 overall
This message was automatically generated. |
This change aims to add support for S3 accesspoints. To use S3 object level APIs for an accesspoint, one has to use the accesspoint (ap) ARN. Hence the following have been added: - a new property to set the accesspoint ARN; - s3a parsing and using the new property with appropriate exceptions; - initial documentation update for accesspoints; We're explicitly throwing now if S3Guard is enabled and you're using an Access Point. Which is the right way to go anyway! Skipping all tests that use AP and S3Guard too. We're also fixing CSEEncryption tests since it was checking for the wrong condition. Adding Access Point only option - Adding a new config option called `fs.s3a.accesspoint.required` which requires all bucket access to be done through access points, otherwise an exception is thrown on initialize; - Updating documentation to include this new option; - Adding new test for the option;
7a0c3aa
to
5710a88
Compare
s3.getBucketAcl(bucket); | ||
} catch (AmazonServiceException ex) { | ||
int statusCode = ex.getStatusCode(); | ||
if (statusCode == 404 || (statusCode == 403 && accessPoint != null)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting to think we should use constants here, just to track down where they come from. I see we don't do that elsewhere (S3AUtils.translateException(), but that doesn't mean we shouldn't start)
Could you add constants here for HTTP_RESPONSE_404 & 403 in InternalConstants & refer here. Then we could retrofit and extend elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a SC_404
in internal constants so I'll use that and add a SC_403
.
@@ -1167,7 +1216,10 @@ public String getBucketLocation(String bucketName) throws IOException { | |||
final String region = trackDurationAndSpan( | |||
STORE_EXISTS_PROBE, bucketName, null, () -> | |||
invoker.retry("getBucketLocation()", bucketName, true, () -> | |||
s3.getBucketLocation(bucketName))); | |||
// If accessPoint then region is known from Arn | |||
accessPoint != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pull this up to L1216 & we can skip the entire overhead of duration tracking, retry etc.
Currently it's overkill to wrap, but it will add it to the iostats, so maybe it's best to leave as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the iostats tracking so it doesn't look like an operation is missing / changed.
@@ -2570,6 +2614,11 @@ protected S3ListResult continueListObjects(S3ListRequest request, | |||
OBJECT_CONTINUE_LIST_REQUEST, | |||
() -> { | |||
if (useListV1) { | |||
if (accessPoint != null) { | |||
// AccessPoints are not compatible with V1List | |||
throw new InvalidRequestException("ListV1 is not supported by AccessPoints"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you've reverted this & letting the SDK fail it. worksforme
- `ListObjectsV1` is not supported, this is also deprecated on AWS S3 for performance reasons; | ||
- The endpoint for S3 requests will automatically change from `s3.amazonaws.com` to use | ||
`s3-accesspoint.REGION.amazonaws.{com | com.cn}` depending on the Access Point ARN. This **only** | ||
happens if the `fs.s3a.endpoint` property isn't set. The endpoint property overwrites any changes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, removing good catch
this is intentional so FIPS or DualStack endpoints can be set. While considering endpoints, | ||
if you have any custom signers that use the host endpoint property make sure to update them if | ||
needed; | ||
- Access Point names don't have to be globally unique, in the same way that bucket names have to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we support it per-bucket only, this bullet point can be cut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
</property> | ||
``` | ||
|
||
While keeping the global `accesspoint.arn` property set to empty `" "` which is the default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes " " can be set in a conf to force an override which getTrimmed() will then downgrade to "". No need to worry about that in these docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done a review through VS.code to see everything discussed has been covered. Answer is: yes
some changes on docs now the option is per-bucket only, and I'm wondering if we should declare some InternalConstants for 403 and 404, to help in future if we want to find out where these are being used
other than that: ready to go1
@steveloughran thanks for having a look again. Addressed the comments. I think VSCode is doing something funky as
comment was on a previous revision, I didn't revert that, ListV1 is upgraded for V2 if using APs (either VS code or rebasing). |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, merging and going to a backport to 3.3
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan (this commit contains the changes to TestArnResource from HADOOP-18068, "upgrade AWS SDK to 1.12.132" so that it works with the later SDK.) Change-Id: I3fac213e52ca6ec1c813effb8496c353964b8e1b
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan (this commit contains the changes to TestArnResource from HADOOP-18068, "upgrade AWS SDK to 1.12.132" so that it works with the later SDK.) Change-Id: I3fac213e52ca6ec1c813effb8496c353964b8e1b
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details.
Add support for S3 Access Points. This provides extra security as it ensures applications are not working with buckets belong to third parties. To bind a bucket to an access point, set the access point (ap) ARN, which must be done for each specific bucket, using the pattern fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN * The global/bucket option `fs.s3a.accesspoint.required` to mandate that buckets must declare their access point. * This is not compatible with S3Guard. Consult the documentation for further details. Contributed by Bogdan Stolojan (this commit contains the changes to TestArnResource from HADOOP-18068, "upgrade AWS SDK to 1.12.132" so that it works with the later SDK.) Change-Id: I3fac213e52ca6ec1c813effb8496c353964b8e1b
HADOOP-17198
This change aims to add support for S3 AccessPoints. To use S3 object level
APIs for an AccessPoint, one has to use the AccessPoint (AP) ARN.
Hence the following have been added:
What this PR enables:
apname
is the name of an AccessPoint you have for created bucket then S3a now allows you to use paths likes3a://apname/
IF the news3a.accesspoint.arn
is set to the AccessPoint ARN e.g.arn:aws:s3:eu-west-1:123456789101:accesspoint/apname
;There's one thing I'm not sure about with this initial implementation so am looking for feedback if and how I should tackle it:
S3a
bucket now has 2 "meanings" it can be a bucket name or an Access Point ARN. From the point of view of interacting with the SDK, they are interchangeable and internal string parsing logic is used to create the request for the right endpoint. However, I think it would be nicer to have a clearer abstraction for bucket names or access point ARNs that S3a operations can work with. This abstraction comes with the cost of doing a refactor which I'm not sure it's worth it right now. Even by doing a quick search on.getHost()
there's quite a few places where the bucket name is deduced from thehost
.