Skip to content

Incorrect order of text lines (use_text_flow=True) #1289

Open
@samuelbradshaw

Description

@samuelbradshaw

Describe the bug

On certain PDFs, lines are returned in an unexpected order when use_text_flow is set to True.

Have you tried repairing the PDF?

Yes

Code to reproduce the problem

import pdfplumber

pdf_path = '/path/to/file.pdf'

with pdfplumber.open(pdf_path, repair=True) as pdf:
  for page in pdf.pages:
    lines = page.extract_text_lines(use_text_flow=True)
    for line in lines:
      print(line['text'])

PDF file

how-great-the-wisdom-and-the-love_bi.pdf

Expected behavior

Lines should be returned in this order:

  1. Stap tingbaot broken bodi blong Kraes,
    Taem yumi brekem bred.
    Dring wora long kap blong yumi witnes,
    Yumi putum Kraes long fored.
  2. Plan blong Papa God hem i komplit
    Blong savem yumi long ol sin.
    Hem i tekem Jastis, Lav mo Mersi
    Blong mekem plan blong Salvesen.

Actual behavior

Lines are returned in this order:

  1. Stap tingbaot broken bodi blong Kraes,
    Taem yumi brekem bred.
  2. Plan blong Papa God hem i komplit
    Blong savem yumi long ol sin.
    Hem i tekem Jastis, Lav mo Mersi
    Blong mekem plan blong Salvesen.
    Dring wora long kap blong yumi witnes,
    Yumi putum Kraes long fored.

Screenshots

Image

Environment

pdfplumber version: 0.11.6
Python version: 3.12.8
OS: macOS 15.4 Sequoia

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions