Skip to content

Conversation

@cardoso
Copy link
Member

@cardoso cardoso commented Aug 10, 2025

Proposed changes (including videos or screenshots)

Updates the file size after stripping exif data. It seems to be causing issues with the S3 storage type when uploading files.

Before

image

After

image

Issue(s)

https://rocketchat.atlassian.net/browse/SUP-833

Steps to test or reproduce

This can only be reproduced with AWS S3 storage type due to recently introduced integrity checks:
https://aws.amazon.com/blogs/aws/introducing-default-data-integrity-protections-for-new-objects-in-amazon-s3/

Additionally, S3 validates the entire file’s size and checksum when you call the CompleteMultipartUpload API.

I used the following python script in Google Colab to generate a minimal pdf file that triggers the issue:
test_exif_embed.pdf

!pip install pillow piexif pymupdf
import fitz  # PyMuPDF
import piexif
from PIL import Image

# Step 1: Create a JPEG with EXIF (or use your own)
img = Image.new('RGB', (400, 300), color='blue')
exif_dict = {"0th": {piexif.ImageIFD.Artist: u"Test Artist"}}
exif_bytes = piexif.dump(exif_dict)
img.save("exif_test.jpg", exif=exif_bytes)
jpeg_path = "exif_test.jpg"  # Or use your own JPEG file with EXIF

# Step 2: Create a PDF and embed the JPEG as an image XObject
doc = fitz.open()
page = doc.new_page(width=600, height=800)
rect = fitz.Rect(100, 100, 500, 400)

with open(jpeg_path, "rb") as f:
    img_bytes = f.read()

# Embed the JPEG as-is (preserves EXIF)
page.insert_image(rect, stream=img_bytes)

doc.save("test_exif_embed.pdf")
print("PDF with embedded EXIF image generated: test_exif_embed.pdf")

Further comments

Initially, the bug surfaced because the Exif stripping logic in exif-be-gone’s ExifTransformer._scrubOther is applied to all "other" file types, including PDFs. However, PDF files do not contain EXIF, XMP, or FLIR markers in the same way as images, and stripping bytes from them can corrupt their structure, especially the xref table.

To strip Exif from images embedded in a PDF, we would need to:

  1. Parse the PDF structure.
  2. Extract each embedded image.
  3. Strip Exif from each image individually.
  4. Re-embed the cleaned images back into the PDF.

This is a much more complex task and requires a dedicated PDF library to manipulate PDF internals. The current stream-based approach is not sufficient for this.

Although we could prevent PDFs from being exif stripped, it would only hide this bug, and in reality most PDF software are still able to work even with the invalid xref table.

@dionisio-bot
Copy link
Contributor

dionisio-bot bot commented Aug 10, 2025

Looks like this PR is ready to merge! 🎉
If you have any trouble, please check the PR guidelines

@changeset-bot
Copy link

changeset-bot bot commented Aug 10, 2025

⚠️ No Changeset found

Latest commit: 8ae3e55

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@github-actions
Copy link
Contributor

github-actions bot commented Aug 10, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://RocketChat.github.io/Rocket.Chat/pr-preview/pr-36676/

Built to branch gh-pages at 2025-08-10 23:13 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@cardoso cardoso added this to the 7.10.0 milestone Aug 10, 2025
@cardoso cardoso marked this pull request as ready for review August 10, 2025 23:11
@cardoso cardoso requested a review from a team as a code owner August 10, 2025 23:11
@cardoso cardoso added the stat: QA assured Means it has been tested and approved by a company insider label Aug 11, 2025
@dionisio-bot dionisio-bot bot added the stat: ready to merge PR tested and approved waiting for merge label Aug 11, 2025
@kodiakhq kodiakhq bot merged commit 61bca86 into develop Aug 11, 2025
87 of 89 checks passed
@kodiakhq kodiakhq bot deleted the fix/exif-length branch August 11, 2025 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stat: QA assured Means it has been tested and approved by a company insider stat: ready to merge PR tested and approved waiting for merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants