ROB: ignore faulty trailing newline during RLE decoding #3355

henningkoertelgmg · 2025-07-03T12:46:10Z

Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim versions a systematic error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endstream". This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.

Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim version a systematíc error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endsrteam". This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.

henningkoertelgmg · 2025-07-03T13:37:49Z

test_data_rle.txt

codecov · 2025-07-03T13:50:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.83%. Comparing base (6b52a0d) to head (d2a2810).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3355      +/-   ##
==========================================
+ Coverage   96.76%   96.83%   +0.07%     
==========================================
  Files          54       54              
  Lines        9076     9094      +18     
  Branches     1676     1677       +1     
==========================================
+ Hits         8782     8806      +24     
+ Misses        176      172       -4     
+ Partials      118      116       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/test_filters.py

pypdf/filters.py

…xception

stefan6419846

Thanks.

henningkoertelgmg · 2025-07-04T09:09:20Z

Again - was a pleasure and I appreciate to get a more robust PDF tool ;-)

@PJBrs

## What's new ### New Features (ENH) - Implement flattening for writer (#3312) by @PJBrs ### Bug Fixes (BUG) - Unterminated object when using PdfWriter with incremental=True (#3345) by @m32 ### Robustness (ROB) - Resolve some image extraction edge cases (#3371) by @stefan6419846 - Ignore faulty trailing newline during RLE decoding (#3355) by @henningkoertelgmg - Gracefully handle odd-length strings in parse_bfchar (#3348) by @stefan6419846 ### Developer Experience (DEV) - Modernize license specifiers (#3338) by @stefan6419846 ### Maintenance (MAINT) - Reduce max-complexity of tool.ruff.lint.mccabe (#3365) by @j-t-1 - Refactor text extraction code by @MartinThoma [Full Changelog](5.7.0...5.8.0)

Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim version a systematíc error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endstream". This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.

@PJBrs

## What's new ### New Features (ENH) - Implement flattening for writer (py-pdf#3312) by @PJBrs ### Bug Fixes (BUG) - Unterminated object when using PdfWriter with incremental=True (py-pdf#3345) by @m32 ### Robustness (ROB) - Resolve some image extraction edge cases (py-pdf#3371) by @stefan6419846 - Ignore faulty trailing newline during RLE decoding (py-pdf#3355) by @henningkoertelgmg - Gracefully handle odd-length strings in parse_bfchar (py-pdf#3348) by @stefan6419846 ### Developer Experience (DEV) - Modernize license specifiers (py-pdf#3338) by @stefan6419846 ### Maintenance (MAINT) - Reduce max-complexity of tool.ruff.lint.mccabe (py-pdf#3365) by @j-t-1 - Refactor text extraction code by @MartinThoma [Full Changelog](py-pdf/pypdf@5.7.0...5.8.0)

TST: added URL for test file

1692c07

TST: add PR number in doc string + code cleanup

038b1e9

henningkoertelgmg marked this pull request as ready for review July 3, 2025 14:22

stefan6419846 reviewed Jul 3, 2025

View reviewed changes

tests/test_filters.py Show resolved Hide resolved

stefan6419846 reviewed Jul 3, 2025

View reviewed changes

tests/test_filters.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 3, 2025

View reviewed changes

pypdf/filters.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Jul 3, 2025

View reviewed changes

pypdf/filters.py Outdated Show resolved Hide resolved

RLE decoding: Introduced new var for data length + new test for EOD e…

d2a2810

…xception

henningkoertelgmg requested a review from stefan6419846 July 4, 2025 08:09

stefan6419846 approved these changes Jul 4, 2025

View reviewed changes

stefan6419846 merged commit 442e8d5 into py-pdf:main Jul 4, 2025
16 checks passed

henningkoertelgmg deleted the ROB-RLE-decoding branch July 4, 2025 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROB: ignore faulty trailing newline during RLE decoding #3355

ROB: ignore faulty trailing newline during RLE decoding #3355

Uh oh!

henningkoertelgmg commented Jul 3, 2025 •

edited

Loading

Uh oh!

henningkoertelgmg commented Jul 3, 2025

Uh oh!

codecov bot commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefan6419846 left a comment

Uh oh!

Uh oh!

henningkoertelgmg commented Jul 4, 2025

Uh oh!

Uh oh!

ROB: ignore faulty trailing newline during RLE decoding #3355

ROB: ignore faulty trailing newline during RLE decoding #3355

Uh oh!

Conversation

henningkoertelgmg commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henningkoertelgmg commented Jul 3, 2025

Uh oh!

codecov bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stefan6419846 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henningkoertelgmg commented Jul 4, 2025

Uh oh!

Uh oh!

henningkoertelgmg commented Jul 3, 2025 •

edited

Loading

codecov bot commented Jul 3, 2025 •

edited

Loading