-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ROB: improve inline image extraction #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2622 +/- ##
==========================================
+ Coverage 95.02% 95.07% +0.04%
==========================================
Files 50 51 +1
Lines 8366 8490 +124
Branches 1674 1694 +20
==========================================
+ Hits 7950 8072 +122
- Misses 258 263 +5
+ Partials 158 155 -3 ☔ View full report in Codecov by Sentry. |
new test file: |
use of: |
@stefan6419846, @MartinThoma, @MasterOdin |
It is rather unlikely that I am able to provide further (real) files for testing. While I have access to lots of PDF files in theory, most of them are confidential. Additionally, the usual routines I use completely omit any inline images as they provide no value for my use cases. If possible, I would indeed prefer to avoid two different implementations of the same filter algorithms - we already have sufficient coverage for the filters outside the inline image extraction and thus re-using them would make more sense and avoid larger coverage issues. |
@stefan6419846 |
add test file for RL encoding: |
Besides the above remarks, there are two bigger issues I would like to talk about:
|
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
@stefan6419846 |
Yes, this is correct. |
## What's new ### New Features (ENH) - Accept ETen-B5 and UniCNS-UTF16 encodings (#2721) by @pubpub-zz - Add decode_as_image() to ContentStreams (#2615) by @pubpub-zz - context manager for PdfReader (#2666) by @tibor-reiss - Add capability to set font and size in fields (#2636) by @pubpub-zz - Allow to pass input file without named argument (#2576) by @pubpub-zz ### Bug Fixes (BUG) - Fix deprecation for Ressources when using old constants (#2705) by @stefan6419846 - Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (#2675) by @pubpub-zz - Reading large compressed images takes huge time to process (#2644) by @snanda85 - Highlighted Text Cannot Be Printed (#2604) by @Nifury - Fix UnboundLocalError on malformed pdf (#2619) by @farjasju ### Documentation (DOC) - Various improvements on docstrings and examples by @j-t-1 ### Robustness (ROB) - Cope with missing Standard 14 fonts in fields (#2677) by @pubpub-zz - Improve inline image extraction (#2622) by @pubpub-zz - Cope with loops in Fields tree (#2656) by @pubpub-zz - Discard /I in choice fields for compatibility with Acrobat (#2614) by @pubpub-zz - Cope with some issues in pillow (#2595) by @pubpub-zz - Cope with some image extraction issues (#2591) by @pubpub-zz ### Maintenance (MAINT) - Deprecate interiour_color with replacement interior_color (#2706) by @j-t-1 - Add deprecate_with_replacement to PdfWriter.find_bookmark (#2674) by @j-t-1 ### Code Style (STY) - Change Link to be a non-markup annotation (#2714) by @j-t-1 [Full Changelog](4.2.0...4.3.0)
@pubpub-zz I have stumbled upon some PDF files which I see side effects of this PR with in the meantime - there is not explicit inline image parsing involved, but apparently some bad interaction. See #2927 for some more details. |
closes #2598