-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Enable encryption in S3 download_files() hook. #35037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # TODO inline this after debugging | ||
| new_args = __sanitize_extra_args() | ||
|
|
||
| obj.load(**new_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to documentation S3.Object.load() doesn't support any of the parameters, see comment from previous attempt to fix it: #25835 (comment). However I thought it might support this arguments by non obvious way https://github.com/boto/boto3/blob/4c34032a4dfe5371bb7ab808c65e3187398d5153/boto3/resources/factory.py#L558-L575 . So I guess you tried to use it on actual AWS environment and it works, am I right?
I think the bigger problem, that we tried to use here High-level Client (aka resource) and which behaviour is unpredictable. IMO we should try to replace where it possible s3 resource by s3 client, which is more flexible and at least predictable, sometime in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure it will work, and here's my reasoning. I created a bucket named "ferruzzi-testing-123" and uploaded an empty text file named "hello.txt" via the console as the setup. Then in a python console:
In [1]: import boto3
...: s3 = boto3.resource('s3')
...: obj = s3.Object("ferruzzi-testing-123", "hello.txt")
In [2]: obj.content_type
Out[2]: 'text/plain'
In [3]: obj.load(content_type="text/plain")It fails gloriously and you get pages of stack traces, but the useful part is:
ParamValidationError: Parameter validation failed:
Unknown parameter in input: "content_type", must be one of: Bucket, IfMatch, IfModifiedSince, IfNoneMatch, IfUnmodifiedSince, Key, Range, VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer, PartNumber, ExpectedBucketOwner, ChecksumMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've also test some similar stuff, but some errored stuff when parameters are invalid
import boto3
session = boto3.session.Session(profile_name=...)
resource = session.resource(service_name="s3")
obj = resource.Object('taragolis-example-bucket', 'setup.cfg')
obj.load(SSECustomerKey="aaaaa", SSECustomerAlgorithm="AES256")Traceback (most recent call last):
File "/Users/taragolis/Library/Application Support/JetBrains/PyCharm2023.2/scratches/some_stuff.py", line 7, in <module>
obj.load(SSECustomerKey="aaaaa", SSECustomerAlgorithm="AES256")
File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/boto3/resources/factory.py", line 564, in do_action
response = action(self, *args, **kwargs)
File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/botocore/client.py", line 980, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad RequestSeems like resource part it abandoned part of boto3: https://github.com/boto/boto3/tree/develop/boto3/data - last changes 6-8 years ago. So I guess no new features available thought High Level Client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. When i tried using it manually in a DAG I kept getting an error that the encryption key format was invalid but didn't find any documentation on what the actual expected format was... I tried string, bytestring, raw string, hex, a handful of things I could think of, but I took that to mean it was at least getting that far and someone who knew enough to want to use the feature would know the right format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might expect base64 key, at least it mentioned into the documentation to API
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html#specifying-s3-c-encryption
However it is documentation related to AWS as service, boto3/botocore for some services might do encoding/decoding behind the scene
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, considering how much documentation boto* has, there is a surprising amount which is not documented.
Pretty sure base64 was one I tried, but I went through a bunch and didn't write down all of them.
Either way, I guess we need to sort out a way to actually try this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure base64 was one I tried
I checked, somewhere internal make it for you, so I grab this headers from debug over failure call. I provide aaaaa as input for key, and that is my botocore request headers
{'x-amz-server-side-encryption-customer-key': 'YWFhYWE=', 'x-amz-server-side-encryption-customer-algorithm': 'AES256', 'x-amz-server-side-encryption-customer-key-MD5': 'redacted but also base64', 'User-Agent': 'Boto3/1.28.17 md/Botocore#1.31.17 ua/2.0 os/macos#22.6.0 md/arch#arm64 lang/python#3.9.10 md/pyimpl#CPython cfg/retry-mode#legacy Botocore/1.31.17 Resource'}The main limitation, that the key should be exactly 32 bytes long, thanks SO for hint, but original sample doesn't work. So finally check it by pure boto3 and that works
import os
import boto3
bucket = 'your-bucket'
key = 'sample.txt'
sse_customer_alg = "AES256"
sse_customer_key = b"a" * 32 # Should be 32 bytes long
# sse_customer_key = "b" * 32 # This also should work
# sse_customer_key = os.urandom(32) # This also should work
# sse_customer_key = "YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWE=" # This one not working
# sse_customer_key = "¥" * 32 # This one not working, in bytes it is greater than 32
session = boto3.session.Session(...) # required params
resource = session.resource(service_name="s3")
client = resource.meta.client
response = client.put_object(
Bucket=bucket,
Key=key,
Body=b"Awesome content!",
SSECustomerAlgorithm=sse_customer_alg,
SSECustomerKey=sse_customer_key,
)
obj = resource.Object(bucket, key)
obj.load(SSECustomerKey=sse_customer_key, SSECustomerAlgorithm=sse_customer_alg)
result = obj.get(
SSECustomerKey=sse_customer_key, SSECustomerAlgorithm=sse_customer_alg
)['Body'].read().decode()
print(result)
# Awesome content!Co-authored-by: Andrey.Anshin@taragol.is
Hey all. I thought I'd take this one on. I think I got the code implemented but I can't seem to get the mocking in the unit test set up right. I'd love some suggestions.
Closes: #25833
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.