Skip to content

Phantom "f" character appears  #1078

Closed
Closed
@alexgrimmy

Description

@alexgrimmy

(Apologies if this is user error)

I'm working on extracting from a series of PDF customer orders. Somehow when I call page.get_textbox(x,y,xx,yy) I get phantom "f" characters.

First, I look at the page's text blocks:

    blocks = page.get_text("blocks")
    for b in blocks:
        print (b)

There is this one block:
(116.70000457763672, 65.65986633300781, 201.9755859375, 74.73955535888672, 'PURCHASE ORDER for\n', 5, 0)
...

Now when I try to extract using get_textbox

    text = page.get_textbox( [116,65,202,75] ).encode("utf8")
    print (text)

I get the following:

b'PURCHASE ORDER for\nf\nf'

Essentially these phantom "f" characters appear?

This is not limited to just this block and happens with all other blocks as well. It seem that anytime there is a \n the function return \nf

I'm running:

  • Windows 10
  • python 3.8.2 64bit
  • pymupdf 1.18.14 (wheel)

Much thanks for the support. Your work is amazing.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions