ROB: improve inline image extraction #2622

pubpub-zz · 2024-05-03T21:05:50Z

codecov · 2024-05-04T13:12:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.07%. Comparing base (c8d722c) to head (7be1fd6).
Report is 64 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2622      +/-   ##
==========================================
+ Coverage   95.02%   95.07%   +0.04%     
==========================================
  Files          50       51       +1     
  Lines        8366     8490     +124     
  Branches     1674     1694      +20     
==========================================
+ Hits         7950     8072     +122     
- Misses        258      263       +5     
+ Partials      158      155       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2024-05-04T14:36:11Z

new test file:
Pages.62.73.from.0560-22_WSP.Plan_July.2022_Version.1.pdf

pubpub-zz · 2024-05-05T19:41:10Z

use of:
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1065245.pdf
as test input
8 images all looking like:

pubpub-zz · 2024-05-07T09:25:28Z

new test:
https://github.com/mozilla/pdf.js/blob/master/test/pdfs/bug1065245.pdf

image 0 :

closes py-pdf#2629

pubpub-zz · 2024-05-09T12:40:09Z

@stefan6419846, @MartinThoma, @MasterOdin
In order to improve test coverage, I'm looking for PDF files with inline images. Can you provide me with some ?

stefan6419846 · 2024-05-09T12:59:21Z

It is rather unlikely that I am able to provide further (real) files for testing. While I have access to lots of PDF files in theory, most of them are confidential. Additionally, the usual routines I use completely omit any inline images as they provide no value for my use cases.

If possible, I would indeed prefer to avoid two different implementations of the same filter algorithms - we already have sufficient coverage for the filters outside the inline image extraction and thus re-using them would make more sense and avoid larger coverage issues.

pubpub-zz · 2024-05-09T15:15:13Z

@stefan6419846
thanks for trying.
There is not really changes in the filter/image processing. The change only applies to improve data extraction from contents.
Will try to find a way to generate some data.

pubpub-zz · 2024-05-11T13:29:18Z

add test file for RL encoding:
RL.pdf
image:

pypdf/filters.py

pypdf/generic/_data_structures.py

stefan6419846 · 2024-05-20T07:58:38Z

Besides the above remarks, there are two bigger issues I would like to talk about:

Importing pypdf._xobj_image_helpers without Pillow is marked as deprecated. Why? Does this mean you want to completely drop support for installations without Pillow?
Could we please use more verbose names for the filter methods while using all-lowercase function names as usually recommended by PEP8?

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

pypdf/generic/_image_inline.py

pypdf/_xobj_image_helpers.py

tests/test_workflows.py

pypdf/generic/_image_inline.py

pubpub-zz · 2024-05-27T07:33:33Z

@stefan6419846
Can you tell me if I'm wrong. The import should be covered by test_image_without_pillow within test_filters.py,Am I wrong ?

stefan6419846 · 2024-05-27T07:34:50Z

Yes, this is correct.

@pubpub-zz

## What's new ### New Features (ENH) - Accept ETen-B5 and UniCNS-UTF16 encodings (#2721) by @pubpub-zz - Add decode_as_image() to ContentStreams (#2615) by @pubpub-zz - context manager for PdfReader (#2666) by @tibor-reiss - Add capability to set font and size in fields (#2636) by @pubpub-zz - Allow to pass input file without named argument (#2576) by @pubpub-zz ### Bug Fixes (BUG) - Fix deprecation for Ressources when using old constants (#2705) by @stefan6419846 - Fix images issue 4 bits encoding and LUT starting with UTF16_BOM (#2675) by @pubpub-zz - Reading large compressed images takes huge time to process (#2644) by @snanda85 - Highlighted Text Cannot Be Printed (#2604) by @Nifury - Fix UnboundLocalError on malformed pdf (#2619) by @farjasju ### Documentation (DOC) - Various improvements on docstrings and examples by @j-t-1 ### Robustness (ROB) - Cope with missing Standard 14 fonts in fields (#2677) by @pubpub-zz - Improve inline image extraction (#2622) by @pubpub-zz - Cope with loops in Fields tree (#2656) by @pubpub-zz - Discard /I in choice fields for compatibility with Acrobat (#2614) by @pubpub-zz - Cope with some issues in pillow (#2595) by @pubpub-zz - Cope with some image extraction issues (#2591) by @pubpub-zz ### Maintenance (MAINT) - Deprecate interiour_color with replacement interior_color (#2706) by @j-t-1 - Add deprecate_with_replacement to PdfWriter.find_bookmark (#2674) by @j-t-1 ### Code Style (STY) - Change Link to be a non-markup annotation (#2714) by @j-t-1 [Full Changelog](4.2.0...4.3.0)

stefan6419846 · 2024-11-12T14:06:57Z

@pubpub-zz I have stumbled upon some PDF files which I see side effects of this PR with in the meantime - there is not explicit inline image parsing involved, but apparently some bad interaction. See #2927 for some more details.

pubpub-zz added 2 commits May 3, 2024 23:04

ROB: improve inline image extraction

b449664

closes py-pdf#2598

fix

44b41a7

complete testing

0952fee

pubpub-zz marked this pull request as draft May 4, 2024 14:47

pubpub-zz added 3 commits May 5, 2024 22:17

complete test

0ba5ae4

tests

fdbc092

fix

fd57ef7

pubpub-zz added 9 commits May 7, 2024 12:09

fix DCT

70f9c02

Fix A85

8996a73

Merge remote-tracking branch 'origin/iss2598' into iss2598

fd6334e

blank

5b38f34

with new link

67d51ea

Merge branch 'pb_stanford' into iss2598

9fb0974

fix test

092e2a5

BUG: Incorrect number of inline images

c5d62a3

closes py-pdf#2629

Merge branch 'iss2629' into iss2598

ae93628

pubpub-zz mentioned this pull request May 8, 2024

BUG: Incorrect number of inline images #2630

Closed

add test for RL + fix

51bea2c

pubpub-zz added 4 commits May 11, 2024 16:25

remove encode as not used for the moment

bd84496

Fix + Test

770aaba

test+fix

a37b73f

test

184e141

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/filters.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_data_structures.py Outdated Show resolved Hide resolved

pubpub-zz and others added 7 commits May 20, 2024 10:36

Update pypdf/_page.py

2874e56

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

81e1f30

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

90fe459

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_page.py

54e4c1d

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/generic/_data_structures.py

d9841dd

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/generic/_data_structures.py

ecdba02

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

update from comments

ae9fdfc

pubpub-zz requested a review from stefan6419846 May 20, 2024 10:22

Merge branch 'main' into iss2598

5347820

stefan6419846 reviewed May 20, 2024

View reviewed changes

pypdf/generic/_image_inline.py Outdated Show resolved Hide resolved

pubpub-zz added 3 commits May 20, 2024 18:10

Update _data_structures.py

bcabdc8

Update _image_inline.py

dc045b6

Update test_generic.py

9c03aa7

stefan6419846 reviewed May 26, 2024

View reviewed changes

pypdf/_xobj_image_helpers.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 26, 2024

View reviewed changes

tests/test_workflows.py Outdated Show resolved Hide resolved

stefan6419846 reviewed May 26, 2024

View reviewed changes

pypdf/generic/_image_inline.py Outdated Show resolved Hide resolved

pubpub-zz added 5 commits May 26, 2024 23:13

Update test_workflows.py

a569598

Update _image_inline.py

a52541e

Update _image_inline.py

cfe61a9

Merge branch 'main' into iss2598

54399d7

remove coverage ignore on PIL import

7be1fd6

stefan6419846 approved these changes May 27, 2024

View reviewed changes

stefan6419846 merged commit 23a81ba into py-pdf:main May 27, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROB: improve inline image extraction #2622

ROB: improve inline image extraction #2622

pubpub-zz commented May 3, 2024

codecov bot commented May 4, 2024 •

edited

Loading

pubpub-zz commented May 4, 2024

pubpub-zz commented May 5, 2024 •

edited

Loading

pubpub-zz commented May 7, 2024 •

edited

Loading

pubpub-zz commented May 9, 2024

stefan6419846 commented May 9, 2024

pubpub-zz commented May 9, 2024

pubpub-zz commented May 11, 2024

stefan6419846 commented May 20, 2024

pubpub-zz commented May 27, 2024

stefan6419846 commented May 27, 2024

stefan6419846 commented Nov 12, 2024

ROB: improve inline image extraction #2622

ROB: improve inline image extraction #2622

Conversation

pubpub-zz commented May 3, 2024

codecov bot commented May 4, 2024 • edited Loading

Codecov Report

pubpub-zz commented May 4, 2024

pubpub-zz commented May 5, 2024 • edited Loading

pubpub-zz commented May 7, 2024 • edited Loading

pubpub-zz commented May 9, 2024

stefan6419846 commented May 9, 2024

pubpub-zz commented May 9, 2024

pubpub-zz commented May 11, 2024

stefan6419846 commented May 20, 2024

pubpub-zz commented May 27, 2024

stefan6419846 commented May 27, 2024

stefan6419846 commented Nov 12, 2024

codecov bot commented May 4, 2024 •

edited

Loading

pubpub-zz commented May 5, 2024 •

edited

Loading

pubpub-zz commented May 7, 2024 •

edited

Loading