-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TIFF processing. Add tests to prevent regression in OCR for gif, jpg, jp2, tiff, webp #587
Conversation
ingestors/media/tiff.py
Outdated
@@ -23,7 +23,8 @@ def ingest(self, file_path, entity): | |||
entity.schema = model.get("Pages") | |||
pdf_path = self.make_work_file("tiff.pdf") | |||
self.exec_command( | |||
"tiff2pdf", file_path, "-x", "300", "-y", "300", "-o", pdf_path | |||
"tiff2pdf", file_path, "-n", "-j", "-x", "300", "-y", "300", "-o", pdf_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why TIFF ingestion was previously failing with these to options enabled? It’s not obvious to me what they had an impact and why removing them fixes the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, did you figure out why only some TIFF files were affected and others not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tillprochaska added a thorough explanation in the PR description, let me know if it is informative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation, this was really helpful.
One follow up question: My understanding of TIFF is very limited, but as TIFF can have multiple pages etc. and apparently embed image data using other image formats, is it possible that a TIFF file contains both JPEG compressed image data as well as other data?
Just for future reference, not necessarily something we need to handle right now. But I found this issue in the libtiff repo which might describe a similar issue: https://gitlab.com/libtiff/libtiff/-/issues/13
If I understand the referenced issue correctly, it might be that it has been fixed in recent libtiff version. The version we’re using is quite old (4.1.0, released in 2019) |
This PR addresses the fact that some TIFF images were not being OCR'ed correctly.
This stemmed from the fact that, if the TIFF file contained data with JPEG compression, the
tiff2pdf
command, as it existed before this commit, would generate a PDF with an empty image.To address this, we have added the
-n
and-j
flags to thetiff2pdf
command:-n
command results in the JPEG-compressed data actually being written to the PDF (according to the tiff2pdf man page this flag sets "no passthrough" option? - not 100% sure on this)-j
flag sets the compression type and keeps the resulting PDF from being blown up in sizeNow, TIFF images that do not contain JPEG-compressed data are not converted correctly to PDF and error out when the
-j
flag is used. Thus, this PR attempts to first use the-n
and-j
flags and then tries to run the command without them, in case it fails. This does not work the other way around, because, if there is JPEG-compressed data in the TIFF, thetiff2pdf
command does not fail, but instead produces an empty image inside a PDF.This PR also adds a test that specifically checks a regression in the OCR behaviour described in this PR. It also updates the existing TIFF parsing test because TIFF images get converted into one "Pages" entity with several "Page" entities and it's only the "Pages" entity that has the "mimeType" property set, so the test specifically looks for that entity before asserting whether or not its properties contain the expected values.