Don't re-compress files before upload to S3. #451 #615

normangilmore · 2018-10-11T01:40:22Z

This proposed change does the following:

Now django-storages will upload .gz and similar files without re-compressing them (as in, take a gzip file and gzip it again), before upload to S3.
It won't set a Content-Encoding header that causes .gz and similar files to be automatically decompressed on download , thus leaving it with the .gz extension when it is already decompressed.

These changes allow .gz files (and other compression types compatible with the Content-Encoding header) to be uploaded and downloaded without modification.

The original code in question is at
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L482

_type, encoding = mimetypes.guess_type(name)
...

    if self.gzip and content_type in self.gzip_content_types:
        content = self._compress_content(content)
        parameters.update({'ContentEncoding': 'gzip'})
    elif encoding:
        # If the content already has a particular encoding, set it
        parameters.update({'ContentEncoding': encoding})

Noting that mimetypes.guess_type("myfile.csv.gz") is ('text/csv', 'gzip').

So if the file is file.csv.gz, and text/csv is in self.gzip_content_types, then it will be compressed again in the first if clause. Downloading normally it seems transparent, because the content-encoding is reversed once by the browser, then you get your inner .gz file back, no surprises.

But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression.

Now look at the elif encoding: clause.
If the file is file.csv.gz, then the content encoding header will be added. On download, the file will be gunzipped by the browser. But the filename is wrong - the OS will see the .gz, and when you try to decompress it, the OS decompression utility is baffled. You have to manually remove the .gz.

So to meet expectations of how these files should actually be named, given their settings, the first branch of the if should add .gz extension to show the double compression exists on S3, and the elif branch should remove the .gz extension to show that the ContentEncoding is removed. Basically, this if block prevents .gz files from being moved back and forth from S3 without modification.

To be clear, I am not advocating that the extension .gz actually be added or removed. I'm just pointing out that would be the logical entailment of what is being attempted.

This revised code does the following:
If a file is NOT compressed (encoding is None), AND the settings indicate it is a ContentType the user wants to store compressed, and be automatically uncompressed, THEN compress it, set the ContentEncoding header, and upload it.

Otherwise, DO NOT set the content-encoding header, and upload the file as is.

When uploading a compressed file to S3, don't add ContentEncoding header. These changes allow .gz files to be uploaded and downloaded without modification.

normangilmore · 2018-10-11T01:46:15Z

Thanks to @skruger for the work that got me quickly oriented on what to look for.

…efault.

normangilmore · 2019-01-11T21:38:59Z

Just discovered that my PR is not quite complete.

Django Storages uses these lines to set the content type:

_type, encoding = mimetypes.guess_type(name)
content_type = getattr(content, 'content_type', None)
content_type = content_type or _type or self.default_content_type

parameters.update({'ContentType': content_type})

So after my proposed PR, when I provide it with a file with extension .csv.gz, the ContentType metadata on is set to text/csv and encoding to None.

I realized today that on Windows, Chrome and Firefox just use the .csv.gz as the file type and download correctly from S3, without dorking with the filename or the content encoding.

Firefox on MacOS works the same as on Windows, just using the filename provided.

However, Chrome on MacOS looks at the ContentType of text/csv and the file extension of .gz, and just decides to strip the .gz and go with the .csv extension.

Safari just assumes it really is a CSV and splats the gzipped file on the screen, requiring a Ctrl-Click to "Download Link As", which then works reasonably.

So the shortcoming of my PR is that it does not set the ContentType to application/octet-stream for files ending in .csv.gz. (or other types that mimetools considers an encoding and not a ContentType.)

I realize that interpreting the ContentType header is the standard, and file extensions are not supposed to be definitive.

Anyway, after going down this rabbit hole, I'm realizing that I'm fighting the tool. I should just provide text/csv files to Django Storages, add that text/csv to the setting GZIP_CONTENT_TYPES, let it gzip and set encoding headers, let my users' browsers remove the encoding, and take up more space on their drives than necessary.

But I'll need to test if Amazon Glue/Athena crawls files based on their extensions or based on the Amazon S3 ContentType and ContentEncoding metadata.

oof.

normangilmore · 2019-01-11T22:17:20Z

Another note.

If you set AWS_S3_FILE_OVERWRITE = False, it will add a string to the filename in case of duplicate.

So uploading schema.csv.gz once, then attempting to upload again will upload with a filename like schema.csv_OHHdroX.gz.

Since the random string is between the csv and the gz, mimetools now doesn't see the .csv, and will set the ContentType to application/octet-stream. So I just discovered that my duplicate files that were auto-renamed to avoid name collision, all have my desired ContentType of application/octet-stream.

Setting AWS_IS_GZIPPED = False doesn't avoid the mimetypes guessing. So need to examine that set of scenarios closely as well.

skruger · 2019-01-31T06:57:14Z

@normangilmore If you can get this merged then I will happily close #453. I had initially wanted to let people manually configure the broken behavior if they were dependent on it, but I like your approach.

normangilmore · 2019-02-02T19:41:09Z

@skruger I think I'll have to come up with a revised PR that is opt-in and doesn't mess with the legacy behavior, which I'm sure a lot of people are blissfully unaware of because it doesn't impact their use case.

I have not figured out if there is a best way, or any way, to set S3 Metadata on a per file basis using Django Storages. There is the global setting on the Storages object object_parameters = setting('AWS_S3_OBJECT_PARAMETERS', {}) but it would be nice if there was a way you could pass parameters to merge/override into those on a per file basis that would reach this internal method self._save_content(obj, content, parameters=parameters).

I guess the temporary work-around is to get the object S3 key after saving and just update the S3 metadata using boto3 directly.

jschneier · 2020-02-03T06:46:03Z

If you explicitly set a ContentEncoding as of #828 then automatic gzip will be disabled which I think should resolve this issue, please let me know if it does not.

Don't re-compress files before upload to S3. jschneier#451

7fa5f1c

When uploading a compressed file to S3, don't add ContentEncoding header. These changes allow .gz files to be uploaded and downloaded without modification.

normangilmore mentioned this pull request Oct 11, 2018

Uploading .gz file to S3 results in file being stored uncompressed #451

Closed

normangilmore added 2 commits October 10, 2018 22:11

Uploading .gz files should no longer set Content-Encoding header by d…

84a537b

…efault.

Add indents to satisfy flake8 linter rule E129.

d72663d

jschneier added the s3boto label Sep 10, 2019

jschneier closed this Feb 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't re-compress files before upload to S3. #451 #615

Don't re-compress files before upload to S3. #451 #615

normangilmore commented Oct 11, 2018

normangilmore commented Oct 11, 2018

normangilmore commented Jan 11, 2019 •

edited

Loading

normangilmore commented Jan 11, 2019

skruger commented Jan 31, 2019

normangilmore commented Feb 2, 2019

jschneier commented Feb 3, 2020

Don't re-compress files before upload to S3. #451 #615

Don't re-compress files before upload to S3. #451 #615

Conversation

normangilmore commented Oct 11, 2018

normangilmore commented Oct 11, 2018

normangilmore commented Jan 11, 2019 • edited Loading

normangilmore commented Jan 11, 2019

skruger commented Jan 31, 2019

normangilmore commented Feb 2, 2019

jschneier commented Feb 3, 2020

normangilmore commented Jan 11, 2019 •

edited

Loading