Skip to content

Conversation

@ferruzzi
Copy link
Contributor

Hey all. I thought I'd take this one on. I think I got the code implemented but I can't seem to get the mocking in the unit test set up right. I'd love some suggestions.

Closes: #25833


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Oct 18, 2023
# TODO inline this after debugging
new_args = __sanitize_extra_args()

obj.load(**new_args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to documentation S3.Object.load() doesn't support any of the parameters, see comment from previous attempt to fix it: #25835 (comment). However I thought it might support this arguments by non obvious way https://github.com/boto/boto3/blob/4c34032a4dfe5371bb7ab808c65e3187398d5153/boto3/resources/factory.py#L558-L575 . So I guess you tried to use it on actual AWS environment and it works, am I right?

I think the bigger problem, that we tried to use here High-level Client (aka resource) and which behaviour is unpredictable. IMO we should try to replace where it possible s3 resource by s3 client, which is more flexible and at least predictable, sometime in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure it will work, and here's my reasoning. I created a bucket named "ferruzzi-testing-123" and uploaded an empty text file named "hello.txt" via the console as the setup. Then in a python console:

In [1]: import boto3
   ...: s3 = boto3.resource('s3')
   ...: obj = s3.Object("ferruzzi-testing-123", "hello.txt")

In [2]: obj.content_type
Out[2]: 'text/plain'

In [3]: obj.load(content_type="text/plain")

It fails gloriously and you get pages of stack traces, but the useful part is:

ParamValidationError: Parameter validation failed:
Unknown parameter in input: "content_type", must be one of: Bucket, IfMatch, IfModifiedSince, IfNoneMatch, IfUnmodifiedSince, Key, Range, VersionId, SSECustomerAlgorithm, SSECustomerKey, SSECustomerKeyMD5, RequestPayer, PartNumber, ExpectedBucketOwner, ChecksumMode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've also test some similar stuff, but some errored stuff when parameters are invalid

import boto3

session = boto3.session.Session(profile_name=...)
resource = session.resource(service_name="s3")

obj = resource.Object('taragolis-example-bucket', 'setup.cfg')
obj.load(SSECustomerKey="aaaaa", SSECustomerAlgorithm="AES256")
Traceback (most recent call last):
  File "/Users/taragolis/Library/Application Support/JetBrains/PyCharm2023.2/scratches/some_stuff.py", line 7, in <module>
    obj.load(SSECustomerKey="aaaaa", SSECustomerAlgorithm="AES256")
  File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/boto3/resources/factory.py", line 564, in do_action
    response = action(self, *args, **kwargs)
  File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/taragolis/.pyenv/versions/airflow-dev-env-39/lib/python3.9/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (400) when calling the HeadObject operation: Bad Request

Seems like resource part it abandoned part of boto3: https://github.com/boto/boto3/tree/develop/boto3/data - last changes 6-8 years ago. So I guess no new features available thought High Level Client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. When i tried using it manually in a DAG I kept getting an error that the encryption key format was invalid but didn't find any documentation on what the actual expected format was... I tried string, bytestring, raw string, hex, a handful of things I could think of, but I took that to mean it was at least getting that far and someone who knew enough to want to use the feature would know the right format.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might expect base64 key, at least it mentioned into the documentation to API
https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html#specifying-s3-c-encryption

However it is documentation related to AWS as service, boto3/botocore for some services might do encoding/decoding behind the scene

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, considering how much documentation boto* has, there is a surprising amount which is not documented.

Pretty sure base64 was one I tried, but I went through a bunch and didn't write down all of them.

Either way, I guess we need to sort out a way to actually try this now.

Copy link
Contributor

@Taragolis Taragolis Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure base64 was one I tried

I checked, somewhere internal make it for you, so I grab this headers from debug over failure call. I provide aaaaa as input for key, and that is my botocore request headers

{'x-amz-server-side-encryption-customer-key': 'YWFhYWE=', 'x-amz-server-side-encryption-customer-algorithm': 'AES256', 'x-amz-server-side-encryption-customer-key-MD5': 'redacted but also base64', 'User-Agent': 'Boto3/1.28.17 md/Botocore#1.31.17 ua/2.0 os/macos#22.6.0 md/arch#arm64 lang/python#3.9.10 md/pyimpl#CPython cfg/retry-mode#legacy Botocore/1.31.17 Resource'}

The main limitation, that the key should be exactly 32 bytes long, thanks SO for hint, but original sample doesn't work. So finally check it by pure boto3 and that works

import os
import boto3

bucket = 'your-bucket'
key = 'sample.txt'
sse_customer_alg = "AES256"
sse_customer_key = b"a" * 32    # Should be 32 bytes long
# sse_customer_key = "b" * 32   # This also should work
# sse_customer_key = os.urandom(32)  # This also should work
# sse_customer_key = "YWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWFhYWE="   # This one not working
# sse_customer_key = "¥" * 32   # This one not working, in bytes it is greater than 32

session = boto3.session.Session(...)  # required params
resource = session.resource(service_name="s3")
client = resource.meta.client

response = client.put_object(
    Bucket=bucket,
    Key=key,
    Body=b"Awesome content!",
    SSECustomerAlgorithm=sse_customer_alg,
    SSECustomerKey=sse_customer_key,
)

obj = resource.Object(bucket, key)
obj.load(SSECustomerKey=sse_customer_key, SSECustomerAlgorithm=sse_customer_alg)

result = obj.get(
    SSECustomerKey=sse_customer_key, SSECustomerAlgorithm=sse_customer_alg
)['Body'].read().decode()

print(result)
# Awesome content!

Co-authored-by: Andrey.Anshin@taragol.is
Co-authored-by: Andrey.Anshin@taragol.is
@ferruzzi ferruzzi marked this pull request as ready for review October 20, 2023 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Airflow Amazon provider S3Hook().download_file() fail when needs encryption arguments (SSECustomerKey etc..)

5 participants