Skip to content

No nodes are extracted from some PDFs #85

@faileon

Description

@faileon

Initial Checks

  • I confirm that I'm on the latest version

Description

I've noticed that when I split my PDF via Firefox to have a smaller PDF (e.g. first 10 pages), openparse wont extract any nodes. Original PDF gets extracted fine.

image

When I specify table_args, it will make parser return some nodes, but all are identified as a table.
image

I am attaching the PDF, perhaps someone could have a look what's wrong.
concept-vp4360-cz.pdf

Example Code

No response

Python, open-parse & OS Version

python_version: 3.12.7
operating_system: Linux
os_version: 6.11.8-arch1-2
open-parse version: 0.7.0
python version: 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910]
platform: Linux-6.11.8-arch1-2-x86_64-with-glibc2.40
related packages: torchvision-0.20.1 tokenizers-0.20.3 torch-2.5.1 pydantic-2.9.2 PyMuPDF-1.24.13 transformers-4.46.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions