Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

catileptic · 2024-02-22T13:50:34Z

This PR addresses the fact that some TIFF images were not being OCR'ed correctly.

This stemmed from the fact that, if the TIFF file contained data with JPEG compression, the tiff2pdf command, as it existed before this commit, would generate a PDF with an empty image.

To address this, we have added the -n and -j flags to the tiff2pdf command:

the -n command results in the JPEG-compressed data actually being written to the PDF (according to the tiff2pdf man page this flag sets "no passthrough" option? - not 100% sure on this)
the -j flag sets the compression type and keeps the resulting PDF from being blown up in size

Now, TIFF images that do not contain JPEG-compressed data are not converted correctly to PDF and error out when the -j flag is used. Thus, this PR attempts to first use the -n and -j flags and then tries to run the command without them, in case it fails. This does not work the other way around, because, if there is JPEG-compressed data in the TIFF, the tiff2pdf command does not fail, but instead produces an empty image inside a PDF.

This PR also adds a test that specifically checks a regression in the OCR behaviour described in this PR. It also updates the existing TIFF parsing test because TIFF images get converted into one "Pages" entity with several "Page" entities and it's only the "Pages" entity that has the "mimeType" property set, so the test specifically looks for that entity before asserting whether or not its properties contain the expected values.

… a readable jpg file.

tillprochaska · 2024-02-28T12:11:36Z

ingestors/media/tiff.py

@@ -23,7 +23,8 @@ def ingest(self, file_path, entity):
        entity.schema = model.get("Pages")
        pdf_path = self.make_work_file("tiff.pdf")
        self.exec_command(
-            "tiff2pdf", file_path, "-x", "300", "-y", "300", "-o", pdf_path
+            "tiff2pdf", file_path, "-n", "-j", "-x", "300", "-y", "300", "-o", pdf_path


Could you explain why TIFF ingestion was previously failing with these to options enabled? It’s not obvious to me what they had an impact and why removing them fixes the issue.

Also, did you figure out why only some TIFF files were affected and others not?

@tillprochaska added a thorough explanation in the PR description, let me know if it is informative

Thanks for the detailed explanation, this was really helpful.

One follow up question: My understanding of TIFF is very limited, but as TIFF can have multiple pages etc. and apparently embed image data using other image formats, is it possible that a TIFF file contains both JPEG compressed image data as well as other data?

tillprochaska · 2024-02-28T17:36:51Z

Just for future reference, not necessarily something we need to handle right now. But I found this issue in the libtiff repo which might describe a similar issue: https://gitlab.com/libtiff/libtiff/-/issues/13

When running a JPEG-compressed TIFF file through the tiff2pdf tool (which basically just changes the TIFF wrapper for a PDF wrapper since PDF can have JPEG-compressed image data in it), the resulting PDF file is not viewable in Acrobat Reader, evince, or Ghostscript, although xpdf does handle it fine.

If I understand the referenced issue correctly, it might be that it has been fixed in recent libtiff version. The version we’re using is quite old (4.1.0, released in 2019)

catileptic added 4 commits February 22, 2024 12:41

Remove handwritten jpg file (text fixture). It has been replaced with…

b766f4e

… a readable jpg file.

Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp

0f3ca52

Fix typos

7842d61

Fix the TIFF to PDF conversion command. Add TIFF test.

d701931

tillprochaska reviewed Feb 28, 2024

View reviewed changes

Handle whether JPEG compression exists in the TIFF image

1e6ef93

catileptic changed the title ~~Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp~~ Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp Feb 28, 2024

Fix linting issues

031f830

stchris added this pull request to the merge queue Feb 29, 2024

Merged via the queue into main with commit f160f1b Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Uh oh!

catileptic commented Feb 22, 2024 •

edited

Loading

Uh oh!

tillprochaska Feb 28, 2024

Uh oh!

tillprochaska Feb 28, 2024

Uh oh!

catileptic Feb 28, 2024

Uh oh!

tillprochaska Feb 28, 2024

Uh oh!

tillprochaska commented Feb 28, 2024

Uh oh!

Uh oh!

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Uh oh!

Conversation

catileptic commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tillprochaska Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

tillprochaska Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

catileptic Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

tillprochaska Feb 28, 2024

Choose a reason for hiding this comment

Uh oh!

tillprochaska commented Feb 28, 2024

Uh oh!

Uh oh!

catileptic commented Feb 22, 2024 •

edited

Loading