Skip to content

Unable to consistently extract field labels from PDFs #3950

Closed
@Rutvik-Trivedi

Description

@Rutvik-Trivedi

Description of the bug

For my usecase, I am trying to extract the widget.field_label field from a PDF file. I tried extracting this field from two PDFs. I am successfully able to extract the field labels from one PDF, but not from the other. If it helps in any way, I used Master PDF Editor to add the field labels for the PDFs.

This is the PDF for which I am able to extract the field labels from all the widgets -
working sample.pdf

This is the PDF for which I am not able to extract the field labels even after adding the labels -
not working sample.pdf

Is this a PDF/Editor level nuance? Or a bug?

How to reproduce the bug

The reproduction of the problem should be fairly simple:

import fitz
doc = fitz.Document("working sample.pdf")  # Or "not working sample.pdf"
for page in doc:
    for widget in page.widgets():
        print(widget.field_label)

PDF files:
working sample.pdf
not working sample.pdf

For working sample.pdf, I get the following output:

{{ firstName }}
{{ lastName }}
{{ address.street }}
{{ address.apt }}
{{ address.zipcode }}
{{ address.city }}
{{ spirit }}
{{ today }}
{{ evil | check }}
{{ language.french | X }}
{{ language.esperento | X }}
{{ language.latin | X }}
{{ sig | paste }}

Which is correct and expected. It covers all the available field labels

For not working sample.pdf, I get the following output:

""
None
None
None

But the expected output for not working sample.pdf should be (not necessarily in the same order):

{{ named_insured }}
{{ insurance_line }}
{{ policy_period_start_date }}
{{ policy_period_end_date }}

which are all the available field labels in the PDF

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.10

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions