Description
Description of the bug
For my usecase, I am trying to extract the widget.field_label
field from a PDF file. I tried extracting this field from two PDFs. I am successfully able to extract the field labels from one PDF, but not from the other. If it helps in any way, I used Master PDF Editor to add the field labels for the PDFs.
This is the PDF for which I am able to extract the field labels from all the widgets -
working sample.pdf
This is the PDF for which I am not able to extract the field labels even after adding the labels -
not working sample.pdf
Is this a PDF/Editor level nuance? Or a bug?
How to reproduce the bug
The reproduction of the problem should be fairly simple:
import fitz
doc = fitz.Document("working sample.pdf") # Or "not working sample.pdf"
for page in doc:
for widget in page.widgets():
print(widget.field_label)
PDF files:
working sample.pdf
not working sample.pdf
For working sample.pdf
, I get the following output:
{{ firstName }}
{{ lastName }}
{{ address.street }}
{{ address.apt }}
{{ address.zipcode }}
{{ address.city }}
{{ spirit }}
{{ today }}
{{ evil | check }}
{{ language.french | X }}
{{ language.esperento | X }}
{{ language.latin | X }}
{{ sig | paste }}
Which is correct and expected. It covers all the available field labels
For not working sample.pdf
, I get the following output:
""
None
None
None
But the expected output for not working sample.pdf
should be (not necessarily in the same order):
{{ named_insured }}
{{ insurance_line }}
{{ policy_period_start_date }}
{{ policy_period_end_date }}
which are all the available field labels in the PDF
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.10