Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587

Merged
merged 6 commits into from
Feb 29, 2024

Conversation

catileptic
Copy link
Contributor

@catileptic catileptic commented Feb 22, 2024

This PR addresses the fact that some TIFF images were not being OCR'ed correctly.

This stemmed from the fact that, if the TIFF file contained data with JPEG compression, the tiff2pdf command, as it existed before this commit, would generate a PDF with an empty image.

To address this, we have added the -n and -j flags to the tiff2pdf command:

  • the -n command results in the JPEG-compressed data actually being written to the PDF (according to the tiff2pdf man page this flag sets "no passthrough" option? - not 100% sure on this)
  • the -j flag sets the compression type and keeps the resulting PDF from being blown up in size

Now, TIFF images that do not contain JPEG-compressed data are not converted correctly to PDF and error out when the -j flag is used. Thus, this PR attempts to first use the -n and -j flags and then tries to run the command without them, in case it fails. This does not work the other way around, because, if there is JPEG-compressed data in the TIFF, the tiff2pdf command does not fail, but instead produces an empty image inside a PDF.

This PR also adds a test that specifically checks a regression in the OCR behaviour described in this PR. It also updates the existing TIFF parsing test because TIFF images get converted into one "Pages" entity with several "Page" entities and it's only the "Pages" entity that has the "mimeType" property set, so the test specifically looks for that entity before asserting whether or not its properties contain the expected values.

@@ -23,7 +23,8 @@ def ingest(self, file_path, entity):
entity.schema = model.get("Pages")
pdf_path = self.make_work_file("tiff.pdf")
self.exec_command(
"tiff2pdf", file_path, "-x", "300", "-y", "300", "-o", pdf_path
"tiff2pdf", file_path, "-n", "-j", "-x", "300", "-y", "300", "-o", pdf_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why TIFF ingestion was previously failing with these to options enabled? It’s not obvious to me what they had an impact and why removing them fixes the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, did you figure out why only some TIFF files were affected and others not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tillprochaska added a thorough explanation in the PR description, let me know if it is informative

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, this was really helpful.

One follow up question: My understanding of TIFF is very limited, but as TIFF can have multiple pages etc. and apparently embed image data using other image formats, is it possible that a TIFF file contains both JPEG compressed image data as well as other data?

@catileptic catileptic changed the title Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp Feb 28, 2024
@tillprochaska
Copy link
Contributor

Just for future reference, not necessarily something we need to handle right now. But I found this issue in the libtiff repo which might describe a similar issue: https://gitlab.com/libtiff/libtiff/-/issues/13

When running a JPEG-compressed TIFF file through the tiff2pdf tool (which basically just changes the TIFF wrapper for a PDF wrapper since PDF can have JPEG-compressed image data in it), the resulting PDF file is not viewable in Acrobat Reader, evince, or Ghostscript, although xpdf does handle it fine.

If I understand the referenced issue correctly, it might be that it has been fixed in recent libtiff version. The version we’re using is quite old (4.1.0, released in 2019)

@stchris stchris added this pull request to the merge queue Feb 29, 2024
Merged via the queue into main with commit f160f1b Feb 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants