Skip to content

Add streaming PDF page loading and improve PDF rendering#76

Merged
JaredforReal merged 2 commits intozai-org:mainfrom
ShayDuane:feature/pdf
Feb 9, 2026
Merged

Add streaming PDF page loading and improve PDF rendering#76
JaredforReal merged 2 commits intozai-org:mainfrom
ShayDuane:feature/pdf

Conversation

@ShayDuane
Copy link
Collaborator

  • Add iterator-based page loading in PageLoader and use it in the async pipeline data-loading thread
  • Introduce pdf_to_images_pil_iter and make PDF rendering more robust with proper resource cleanup in image_utils On branch feature/pdf
    Changes to be committed:
    modified: glmocr/dataloader/page_loader.py
    modified: glmocr/pipeline/pipeline.py
    modified: glmocr/utils/image_utils.py

Contribution Guide

We welcome your contributions to this repository. To ensure elegant code style and better code quality, we have prepared
the following contribution guidelines.

What We Accept

  • This PR fixes a typo or improves the documentation (if this is the case, you may skip the other checks).
  • This PR fixes a specific issue — please reference the issue number in the PR description. Make sure your code strictly
    follows the coding standards below.
  • This PR introduces a new feature — please clearly explain the necessity and implementation of the feature. Make sure
    your code strictly follows the coding standards below.

Code Style Guide

Good code style is an art. We have prepared a pre-commit hook to enforce consistent code
formatting across the project. You can clean up your code following the steps below:

pre-commit run --all-files

If your code complies with the standards, you should not see any errors.

Naming Conventions

  • Please use English for naming; do not use Pinyin or other languages. All comments should also be in English.
  • Follow PEP8 naming conventions strictly, and use underscores to separate words. Avoid meaningless names such as
    a, b, c.

- Add iterator-based page loading in PageLoader and use it in the async pipeline data-loading thread
- Introduce pdf_to_images_pil_iter and make PDF rendering more robust with proper resource cleanup in image_utils
On branch feature/pdf
Changes to be committed:
	modified:   glmocr/dataloader/page_loader.py
	modified:   glmocr/pipeline/pipeline.py
	modified:   glmocr/utils/image_utils.py
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds streaming (iterator-based) PDF page loading so the async pipeline can begin layout/recognition work before the full PDF is rendered, and it updates PDF rendering utilities to more aggressively clean up pdfium resources.

Changes:

  • Add pdf_to_images_pil_iter() generator and update pdf_to_images_pil() to explicitly close pages/documents.
  • Add PageLoader.iter_pages_with_unit_indices() streaming API (with PDF streaming support via _iter_pdf).
  • Switch the async pipeline data-loading thread to consume the new streaming API.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
glmocr/utils/image_utils.py Adds generator-based PDF rendering and introduces explicit page/document cleanup.
glmocr/dataloader/page_loader.py Adds streaming page iteration with unit indices; adds streaming PDF path.
glmocr/pipeline/pipeline.py Uses the streaming iterator in the async loading thread to enqueue pages as they arrive.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ShayDuane ShayDuane force-pushed the feature/pdf branch 2 times, most recently from c893ab1 to 34b86ce Compare February 9, 2026 08:46
Signed-off-by: Shay <d18833276858@gmail.com>
@JaredforReal JaredforReal merged commit 529a0c7 into zai-org:main Feb 9, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants