Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phantom "f" character appears #1078

Closed
alexgrimmy opened this issue Jun 5, 2021 · 8 comments
Closed

Phantom "f" character appears #1078

alexgrimmy opened this issue Jun 5, 2021 · 8 comments
Assignees
Labels

Comments

@alexgrimmy
Copy link

(Apologies if this is user error)

I'm working on extracting from a series of PDF customer orders. Somehow when I call page.get_textbox(x,y,xx,yy) I get phantom "f" characters.

First, I look at the page's text blocks:

    blocks = page.get_text("blocks")
    for b in blocks:
        print (b)

There is this one block:
(116.70000457763672, 65.65986633300781, 201.9755859375, 74.73955535888672, 'PURCHASE ORDER for\n', 5, 0)
...

Now when I try to extract using get_textbox

    text = page.get_textbox( [116,65,202,75] ).encode("utf8")
    print (text)

I get the following:

b'PURCHASE ORDER for\nf\nf'

Essentially these phantom "f" characters appear?

This is not limited to just this block and happens with all other blocks as well. It seem that anytime there is a \n the function return \nf

I'm running:

  • Windows 10
  • python 3.8.2 64bit
  • pymupdf 1.18.14 (wheel)

Much thanks for the support. Your work is amazing.

@JorjMcKie
Copy link
Collaborator

Thanks for the compliments 😎!
Cannot reproduce this with a new "handmade" PDF - doesn't happen:

>>> b=page.get_text("blocks")[0]
>>> print(b)
(100.0, 88.17500305175781, 142.18499755859375, 103.28900146484375, 'pymupdf\n', 0, 0)
>>> r=b[:4]
>>> page.get_textbox(r)
'pymupdf'
>>> page.get_textbox(r).encode("utf8")
b'pymupdf'
>>> 

The get_textbox() method is implemented as get_text("text", clip=rect). From the result I remove the last \n if present.
So I guess I would need an example file / page to reproduce. If there are confidentiality concerns, please use my e-mail address.
BTW: any special reason why you need to encode this with UTF8?

@alexgrimmy
Copy link
Author

I'll send the file along to your e-mail address. No specific reason for utf8, I was trying various methods to no avail and that was the last state of where the code was at. Thanks!

@JorjMcKie
Copy link
Collaborator

I have reproduce the problem in the meantime - very weird, let's see.

@alexgrimmy
Copy link
Author

alexgrimmy commented Jun 5, 2021 via email

@JorjMcKie
Copy link
Collaborator

I have the problem:
This file has so-called "ligatures" - characters combined in a single glyph - which get decomposed into their constituents with the flag defaults currently set.
I need to change that.
To get your result, do not use page.get_textbox() until this is fixed. Instead please use the method it wraps with the following parameters:

# define a get_textbox replacement
my_textbox = lambda page,rect: page.get_text("text",flags=fitz.TEXT_PRESERVE_LIGATURES,clip=rect)[:-1]
# use it like so
text = my_textbox(page, rect)

@alexgrimmy
Copy link
Author

alexgrimmy commented Jun 6, 2021 via email

@JorjMcKie
Copy link
Collaborator

Another option:
Simply replace file utils.py of the PyMuPDF installation ...\Python38\Lib\site-packages\fitz with this one:
utils.zip

@JorjMcKie
Copy link
Collaborator

Fixed in new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants