Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PdfReader method _get_outlines() can produce outline items with incorrect "/Title" #1121

Closed
mtd91429 opened this issue Jul 17, 2022 · 2 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF MCVE in Tests The MCVE was added to PyPDF2 test suite

Comments

@mtd91429
Copy link
Contributor

mtd91429 commented Jul 17, 2022

When obtaining outlines from a PDF with bookmarks that were copied/pasted and then re-titled, some of the returned outline items point to the wrong title.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Windows-10-10.0.19044-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.5.0 (commit ed5ecd9)
Python 3.10

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader
reader = PdfReader("mistitled_outlines_example.pdf")

def show_tree(outlines, indent=0):
    for item in outlines:
        if isinstance(item, list):
            show_tree(item, indent+4)
        else:           
            print(f'{" "*indent}{item.title}')

show_tree(reader.outlines)
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
        Twenty-third
        Twenty-fourth
    Twenty-fifth
        Twenty-sixth
        Twenty-seventh
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
Twenty-fourth
    Twenty-fifth
    Twenty-sixth
Twenty-seventh
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
    Twenty-fourth
    Twenty-fifth
    Twenty-sixth
    Twenty-seventh

The expected output for this file is:

First
    Second
    Third
    Fourth
        Fifth
        Sixth
    Seventh
        Eighth
        Ninth
Tenth
    Eleventh
    Twelfth
    Thirteenth
    Fourteenth
Fifteenth
    Sixteenth
    Seventeenth
Eighteenth
Nineteenth
    Twentieth
    Twenty-first
    Twenty-second
    Twenty-third
    Twenty-fourth
    Twenty-fifth
    Twenty-sixth
    Twenty-seventh

Here is the PDF:
mistitled_outlines_example.pdf

Here is a screenshot of the outline in Adobe Acrobat Reader:

image

@MartinThoma MartinThoma added is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022
MartinThoma added a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022
This is in the context of py-pdf/pypdf#1121

Co-authored-by: mtd91429 <mtd91429@users.noreply.github.com>
MartinThoma added a commit to py-pdf/sample-files that referenced this issue Jul 17, 2022
This is in the context of py-pdf/pypdf#1121

Co-authored-by: mtd91429 <mtd91429@users.noreply.github.com>
MartinThoma added a commit that referenced this issue Jul 17, 2022
MartinThoma added a commit that referenced this issue Jul 17, 2022
@MartinThoma MartinThoma added MCVE in Tests The MCVE was added to PyPDF2 test suite and removed Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests labels Jul 17, 2022
@MartinThoma
Copy link
Member

Thank you for the good example! I've added it to the unit tests. If anybody knows how to fix this, it's now easy to test it :-)

@mtd91429
Copy link
Contributor Author

I think this has to do with how Python handles pointers and the fact that the outline objects are all recycling the same named destination. Specifically, the "First" outline entry points to the named destination "section.1", as does the "Tenth" and "Nineteenth" outline entries; the "Second", "Eleventh", and "Twentieth" point to "section.2".

I've been stepping through the code to determine where the error is introduced, and I think it occurs at https://github.com/py-pdf/PyPDF2/blob/ae0ff49058e6c57a8edcfcd3d956665ddaa8a787/PyPDF2/_reader.py#L837

I think I have a fix and issued a pull request #1128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF MCVE in Tests The MCVE was added to PyPDF2 test suite
Projects
None yet
Development

No branches or pull requests

2 participants