fix: wrong file length after exif strip #36676
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes (including videos or screenshots)
Updates the file size after stripping exif data. It seems to be causing issues with the S3 storage type when uploading files.
Before
After
Issue(s)
https://rocketchat.atlassian.net/browse/SUP-833
Steps to test or reproduce
This can only be reproduced with AWS S3 storage type due to recently introduced integrity checks:
https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/
I used the following python script in Google Colab to generate a minimal pdf file that triggers the issue:
test_exif_embed.pdf
Further comments
Initially, the bug surfaced because the Exif stripping logic in exif-be-gone’s ExifTransformer._scrubOther is applied to all "other" file types, including PDFs. However, PDF files do not contain EXIF, XMP, or FLIR markers in the same way as images, and stripping bytes from them can corrupt their structure, especially the xref table.
To strip Exif from images embedded in a PDF, we would need to:
This is a much more complex task and requires a dedicated PDF library to manipulate PDF internals. The current stream-based approach is not sufficient for this.
Although we could prevent PDFs from being exif stripped, it would only hide this bug, and in reality most PDF software are still able to work even with the invalid xref table.