Skip to content

ROB: ignore faulty trailing newline during RLE decoding #3355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 4, 2025

Conversation

henningkoertelgmg
Copy link
Contributor

@henningkoertelgmg henningkoertelgmg commented Jul 3, 2025

Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim versions a systematic error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endstream". This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.

Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim version a systematíc error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endsrteam".
This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.
@henningkoertelgmg
Copy link
Contributor Author

test_data_rle.txt

Copy link

codecov bot commented Jul 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.83%. Comparing base (6b52a0d) to head (d2a2810).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3355      +/-   ##
==========================================
+ Coverage   96.76%   96.83%   +0.07%     
==========================================
  Files          54       54              
  Lines        9076     9094      +18     
  Branches     1676     1677       +1     
==========================================
+ Hits         8782     8806      +24     
+ Misses        176      172       -4     
+ Partials      118      116       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@henningkoertelgmg henningkoertelgmg marked this pull request as ready for review July 3, 2025 14:22
Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@stefan6419846 stefan6419846 merged commit 442e8d5 into py-pdf:main Jul 4, 2025
16 checks passed
@henningkoertelgmg
Copy link
Contributor Author

Again - was a pleasure and I appreciate to get a more robust PDF tool ;-)

@henningkoertelgmg henningkoertelgmg deleted the ROB-RLE-decoding branch July 4, 2025 09:09
stefan6419846 added a commit that referenced this pull request Jul 13, 2025
## What's new

### New Features (ENH)
- Implement flattening for writer (#3312) by @PJBrs

### Bug Fixes (BUG)
- Unterminated object when using PdfWriter with incremental=True (#3345) by @m32

### Robustness (ROB)
- Resolve some image extraction edge cases (#3371) by @stefan6419846
- Ignore faulty trailing newline during RLE decoding (#3355) by @henningkoertelgmg
- Gracefully handle odd-length strings in parse_bfchar (#3348) by @stefan6419846

### Developer Experience (DEV)
- Modernize license specifiers (#3338) by @stefan6419846

### Maintenance (MAINT)
- Reduce max-complexity of tool.ruff.lint.mccabe (#3365) by @j-t-1
- Refactor text extraction code by @MartinThoma

[Full Changelog](5.7.0...5.8.0)
larsga pushed a commit to larsga/pypdf that referenced this pull request Jul 21, 2025
Found PDFs from Dalim software with multi-encoded streams: inner stream is RLE, outer stream is FLATE. The inner stream contains a trailing newline char that breaks the RLE decoding. It seems that there was in some Dalim version a systematíc error that included the bytes of the inner stream just from raw PDF bytes with the trailing newline before "endstream".
This is fixed with the changes by ignoring the trailing newline and raising a warning instead of an exception.
larsga pushed a commit to larsga/pypdf that referenced this pull request Jul 21, 2025
## What's new

### New Features (ENH)
- Implement flattening for writer (py-pdf#3312) by @PJBrs

### Bug Fixes (BUG)
- Unterminated object when using PdfWriter with incremental=True (py-pdf#3345) by @m32

### Robustness (ROB)
- Resolve some image extraction edge cases (py-pdf#3371) by @stefan6419846
- Ignore faulty trailing newline during RLE decoding (py-pdf#3355) by @henningkoertelgmg
- Gracefully handle odd-length strings in parse_bfchar (py-pdf#3348) by @stefan6419846

### Developer Experience (DEV)
- Modernize license specifiers (py-pdf#3338) by @stefan6419846

### Maintenance (MAINT)
- Reduce max-complexity of tool.ruff.lint.mccabe (py-pdf#3365) by @j-t-1
- Refactor text extraction code by @MartinThoma

[Full Changelog](py-pdf/pypdf@5.7.0...5.8.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants