Closed
Description
(Apologies if this is user error)
I'm working on extracting from a series of PDF customer orders. Somehow when I call page.get_textbox(x,y,xx,yy) I get phantom "f" characters.
First, I look at the page's text blocks:
blocks = page.get_text("blocks")
for b in blocks:
print (b)
There is this one block:
(116.70000457763672, 65.65986633300781, 201.9755859375, 74.73955535888672, 'PURCHASE ORDER for\n', 5, 0)
...
Now when I try to extract using get_textbox
text = page.get_textbox( [116,65,202,75] ).encode("utf8")
print (text)
I get the following:
b'PURCHASE ORDER for\nf\nf'
Essentially these phantom "f" characters appear?
This is not limited to just this block and happens with all other blocks as well. It seem that anytime there is a \n the function return \nf
I'm running:
- Windows 10
- python 3.8.2 64bit
- pymupdf 1.18.14 (wheel)
Much thanks for the support. Your work is amazing.