-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phantom "f" character appears #1078
Comments
Thanks for the compliments 😎! >>> b=page.get_text("blocks")[0]
>>> print(b)
(100.0, 88.17500305175781, 142.18499755859375, 103.28900146484375, 'pymupdf\n', 0, 0)
>>> r=b[:4]
>>> page.get_textbox(r)
'pymupdf'
>>> page.get_textbox(r).encode("utf8")
b'pymupdf'
>>> The |
I'll send the file along to your e-mail address. No specific reason for utf8, I was trying various methods to no avail and that was the last state of where the code was at. Thanks! |
I have reproduce the problem in the meantime - very weird, let's see. |
Much thanks!
…Sent from my iPhone
On Jun 5, 2021, at 10:33 AM, Jorj X. McKie ***@***.***> wrote:
I have reproduce the problem in the meantime - very weird, let's see.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I have the problem: # define a get_textbox replacement
my_textbox = lambda page,rect: page.get_text("text",flags=fitz.TEXT_PRESERVE_LIGATURES,clip=rect)[:-1]
# use it like so
text = my_textbox(page, rect) |
This worked beautifully, thank you!!
p.s. I'm using this technique to extract out of tables and appears to work really well.https://towardsdatascience.com/a-table-detection-cell-recognition-and-text-extraction-algorithm-to-convert-tables-to-excel-files-902edcf289ec
Alex
On Saturday, June 5, 2021, 5:11:40 PM CDT, Jorj X. McKie ***@***.***> wrote:
I have the problem:
This file has so-called "ligatures" - characters combined in a single glyph - which get decomposed into their constituents with the flag defaults currently set.
I need to change that.
To get your result, do not use page.get_textbox() until this is fixed. Instead please use the method it wraps with the following parameters:
# define a get_textbox replacement
my_textbox = lambda page,rect: page.get_text("text",flags=fitz.TEXT_PRESERVE_LIGATURES,clip=rect)[:-1]
# use it like so
text = my_textbox(page, rect)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Another option: |
Fixed in new version. |
(Apologies if this is user error)
I'm working on extracting from a series of PDF customer orders. Somehow when I call page.get_textbox(x,y,xx,yy) I get phantom "f" characters.
First, I look at the page's text blocks:
There is this one block:
(116.70000457763672, 65.65986633300781, 201.9755859375, 74.73955535888672, 'PURCHASE ORDER for\n', 5, 0)
...
Now when I try to extract using get_textbox
I get the following:
b'PURCHASE ORDER for\nf\nf'
Essentially these phantom "f" characters appear?
This is not limited to just this block and happens with all other blocks as well. It seem that anytime there is a \n the function return \nf
I'm running:
Much thanks for the support. Your work is amazing.
The text was updated successfully, but these errors were encountered: