-
-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't re-compress files before upload to S3. #451 #615
Don't re-compress files before upload to S3. #451 #615
Conversation
When uploading a compressed file to S3, don't add ContentEncoding header. These changes allow .gz files to be uploaded and downloaded without modification.
Thanks to @skruger for the work that got me quickly oriented on what to look for. |
Just discovered that my PR is not quite complete. Django Storages uses these lines to set the content type:
So after my proposed PR, when I provide it with a file with extension I realized today that on Windows, Chrome and Firefox just use the Firefox on MacOS works the same as on Windows, just using the filename provided. However, Chrome on MacOS looks at the ContentType of Safari just assumes it really is a CSV and splats the gzipped file on the screen, requiring a Ctrl-Click to "Download Link As", which then works reasonably. So the shortcoming of my PR is that it does not set the ContentType to I realize that interpreting the ContentType header is the standard, and file extensions are not supposed to be definitive. Anyway, after going down this rabbit hole, I'm realizing that I'm fighting the tool. I should just provide But I'll need to test if Amazon Glue/Athena crawls files based on their extensions or based on the Amazon S3 ContentType and ContentEncoding metadata. oof. |
Another note. If you set So uploading Since the random string is between the csv and the gz, mimetools now doesn't see the .csv, and will set the ContentType to Setting |
@normangilmore If you can get this merged then I will happily close #453. I had initially wanted to let people manually configure the broken behavior if they were dependent on it, but I like your approach. |
@skruger I think I'll have to come up with a revised PR that is opt-in and doesn't mess with the legacy behavior, which I'm sure a lot of people are blissfully unaware of because it doesn't impact their use case. I have not figured out if there is a best way, or any way, to set S3 Metadata on a per file basis using Django Storages. There is the global setting on the Storages object I guess the temporary work-around is to get the object S3 key after saving and just update the S3 metadata using boto3 directly. |
If you explicitly set a |
This proposed change does the following:
.gz
and similar files without re-compressing them (as in, take a gzip file and gzip it again), before upload to S3..gz
and similar files to be automatically decompressed on download , thus leaving it with the.gz
extension when it is already decompressed.These changes allow .gz files (and other compression types compatible with the Content-Encoding header) to be uploaded and downloaded without modification.
The original code in question is at
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L482
_type, encoding = mimetypes.guess_type(name)
...
Noting that
mimetypes.guess_type("myfile.csv.gz")
is('text/csv', 'gzip')
.So if the file is
file.csv.gz
, andtext/csv
is inself.gzip_content_types
, then it will be compressed again in the first if clause. Downloading normally it seems transparent, because the content-encoding is reversed once by the browser, then you get your inner .gz file back, no surprises.But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression.
Now look at the
elif encoding:
clause.If the file is
file.csv.gz
, then the content encoding header will be added. On download, the file will be gunzipped by the browser. But the filename is wrong - the OS will see the.gz
, and when you try to decompress it, the OS decompression utility is baffled. You have to manually remove the.gz
.So to meet expectations of how these files should actually be named, given their settings, the first branch of the
if
should add.gz
extension to show the double compression exists on S3, and theelif
branch should remove the.gz
extension to show that the ContentEncoding is removed. Basically, this if block prevents .gz files from being moved back and forth from S3 without modification.To be clear, I am not advocating that the extension .gz actually be added or removed. I'm just pointing out that would be the logical entailment of what is being attempted.
This revised code does the following:
If a file is NOT compressed (encoding is None), AND the settings indicate it is a ContentType the user wants to store compressed, and be automatically uncompressed, THEN compress it, set the ContentEncoding header, and upload it.
Otherwise, DO NOT set the content-encoding header, and upload the file as is.