Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't re-compress files before upload to S3. #451 #615

Conversation

normangilmore
Copy link

This proposed change does the following:

  1. Now django-storages will upload .gz and similar files without re-compressing them (as in, take a gzip file and gzip it again), before upload to S3.
  2. It won't set a Content-Encoding header that causes .gz and similar files to be automatically decompressed on download , thus leaving it with the .gz extension when it is already decompressed.

These changes allow .gz files (and other compression types compatible with the Content-Encoding header) to be uploaded and downloaded without modification.

The original code in question is at
https://github.com/jschneier/django-storages/blob/master/storages/backends/s3boto3.py#L482

_type, encoding = mimetypes.guess_type(name)
...

    if self.gzip and content_type in self.gzip_content_types:
        content = self._compress_content(content)
        parameters.update({'ContentEncoding': 'gzip'})
    elif encoding:
        # If the content already has a particular encoding, set it
        parameters.update({'ContentEncoding': encoding})

Noting that mimetypes.guess_type("myfile.csv.gz") is ('text/csv', 'gzip').

So if the file is file.csv.gz, and text/csv is in self.gzip_content_types, then it will be compressed again in the first if clause. Downloading normally it seems transparent, because the content-encoding is reversed once by the browser, then you get your inner .gz file back, no surprises.

But if you are having the files scanned on S3, by, say, AWS Glue, then it will be totally blocked by the double compression.

Now look at the elif encoding: clause.
If the file is file.csv.gz, then the content encoding header will be added. On download, the file will be gunzipped by the browser. But the filename is wrong - the OS will see the .gz, and when you try to decompress it, the OS decompression utility is baffled. You have to manually remove the .gz.

So to meet expectations of how these files should actually be named, given their settings, the first branch of the if should add .gz extension to show the double compression exists on S3, and the elif branch should remove the .gz extension to show that the ContentEncoding is removed. Basically, this if block prevents .gz files from being moved back and forth from S3 without modification.

To be clear, I am not advocating that the extension .gz actually be added or removed. I'm just pointing out that would be the logical entailment of what is being attempted.

This revised code does the following:
If a file is NOT compressed (encoding is None), AND the settings indicate it is a ContentType the user wants to store compressed, and be automatically uncompressed, THEN compress it, set the ContentEncoding header, and upload it.

Otherwise, DO NOT set the content-encoding header, and upload the file as is.

When uploading a compressed file to S3, don't add ContentEncoding header.

These changes allow .gz files to be uploaded and downloaded without modification.
@normangilmore
Copy link
Author

Thanks to @skruger for the work that got me quickly oriented on what to look for.

@normangilmore
Copy link
Author

normangilmore commented Jan 11, 2019

Just discovered that my PR is not quite complete.

Django Storages uses these lines to set the content type:

_type, encoding = mimetypes.guess_type(name)
content_type = getattr(content, 'content_type', None)
content_type = content_type or _type or self.default_content_type

parameters.update({'ContentType': content_type})

So after my proposed PR, when I provide it with a file with extension .csv.gz, the ContentType metadata on is set to text/csv and encoding to None.

I realized today that on Windows, Chrome and Firefox just use the .csv.gz as the file type and download correctly from S3, without dorking with the filename or the content encoding.

Firefox on MacOS works the same as on Windows, just using the filename provided.

However, Chrome on MacOS looks at the ContentType of text/csv and the file extension of .gz, and just decides to strip the .gz and go with the .csv extension.

Safari just assumes it really is a CSV and splats the gzipped file on the screen, requiring a Ctrl-Click to "Download Link As", which then works reasonably.

So the shortcoming of my PR is that it does not set the ContentType to application/octet-stream for files ending in .csv.gz. (or other types that mimetools considers an encoding and not a ContentType.)

I realize that interpreting the ContentType header is the standard, and file extensions are not supposed to be definitive.

Anyway, after going down this rabbit hole, I'm realizing that I'm fighting the tool. I should just provide text/csv files to Django Storages, add that text/csv to the setting GZIP_CONTENT_TYPES, let it gzip and set encoding headers, let my users' browsers remove the encoding, and take up more space on their drives than necessary.

But I'll need to test if Amazon Glue/Athena crawls files based on their extensions or based on the Amazon S3 ContentType and ContentEncoding metadata.

oof.

@normangilmore
Copy link
Author

Another note.

If you set AWS_S3_FILE_OVERWRITE = False, it will add a string to the filename in case of duplicate.

So uploading schema.csv.gz once, then attempting to upload again will upload with a filename like schema.csv_OHHdroX.gz.

Since the random string is between the csv and the gz, mimetools now doesn't see the .csv, and will set the ContentType to application/octet-stream. So I just discovered that my duplicate files that were auto-renamed to avoid name collision, all have my desired ContentType of application/octet-stream.

Setting AWS_IS_GZIPPED = False doesn't avoid the mimetypes guessing. So need to examine that set of scenarios closely as well.

@skruger
Copy link

skruger commented Jan 31, 2019

@normangilmore If you can get this merged then I will happily close #453. I had initially wanted to let people manually configure the broken behavior if they were dependent on it, but I like your approach.

@normangilmore
Copy link
Author

@skruger I think I'll have to come up with a revised PR that is opt-in and doesn't mess with the legacy behavior, which I'm sure a lot of people are blissfully unaware of because it doesn't impact their use case.

I have not figured out if there is a best way, or any way, to set S3 Metadata on a per file basis using Django Storages. There is the global setting on the Storages object object_parameters = setting('AWS_S3_OBJECT_PARAMETERS', {}) but it would be nice if there was a way you could pass parameters to merge/override into those on a per file basis that would reach this internal method self._save_content(obj, content, parameters=parameters).

I guess the temporary work-around is to get the object S3 key after saving and just update the S3 metadata using boto3 directly.

@jschneier
Copy link
Owner

If you explicitly set a ContentEncoding as of #828 then automatic gzip will be disabled which I think should resolve this issue, please let me know if it does not.

@jschneier jschneier closed this Feb 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants