page.getText('html') returns empty string #726

further-reading · 2020-11-19T15:14:05Z

Describe the bug (mandatory)

page.getText('html') is returning an empty string for some files. Interestingly, page.getText('text') returns content so it is unclear why it is failing.

To Reproduce (mandatory)

Code:

import fitz  # import pymupdf by importing fitz
from io import BytesIO
import requests


# Working file
# url =  'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/09/30/nen_price_202011.pdf'

# Broken file
# url = 'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/06/29/nen_price_202008.pdf'

res = requests.request('get', url)
data = BytesIO(res.content)
doc = fitz.open(stream=data, filetype="pdf")
page = doc[0]
text = page.getText('text')
html = page.getText('html')

When using the url tagged # Working file everything works fine. When using the url tagged # Broken file html is empty while text has content.

Expected behavior (optional)

I should have gotten the file converted to a html format, or if there is an issue parsing some sort of error message.

Your configuration (mandatory)

Ubuntu 20.04.1 LTS
Python 3.8.5
PyMuPDF version 1.18.3 installed via pip

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2020-11-20T08:52:32Z

Interesting case!
Output "xhtml" works, but not "html" / "xml". Also, mutool draw -o page.html nen_price_202008.pdf does work and produces an apparently correct file.
This tool and PyMuPDF use the same base MuPDF function fz_print_stext_page_as_html.
Investigating ...

JorjMcKie · 2020-11-20T09:41:29Z

Found the problem:
The nen_price_202008.pdf file contains non-UTF8 characters and I was using encoding error-handling "strict". Changing this to "replace" produces output.

further-reading · 2020-11-20T09:50:02Z

Nice one, thanks for the quick response!

JorjMcKie · 2020-11-20T09:58:13Z

More specifically, the non-UTF8 characters only occur in the fontnames. You can have a look into this by comparing doc.getPageFontList(0) of the two PDFs.

JorjMcKie · 2020-11-20T11:01:48Z

you can download a pre-version wheel from here.
osx is already done, the linux branch is still in waiting queue.

further-reading · 2020-11-20T11:50:21Z

Thanks for the quick fix! Just ran it locally on linux and it is working fine now.

JorjMcKie · 2020-11-20T13:39:56Z

I think I will make that version public over the weekend.

JorjMcKie · 2020-11-20T17:10:27Z

New version available on PyPI:

further-reading added the bug label Nov 19, 2020

further-reading assigned JorjMcKie Nov 19, 2020

JorjMcKie closed this as completed Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page.getText('html') returns empty string #726

page.getText('html') returns empty string #726

further-reading commented Nov 19, 2020

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

further-reading commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

further-reading commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

page.getText('html') returns empty string #726

page.getText('html') returns empty string #726

Comments

further-reading commented Nov 19, 2020

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

further-reading commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

further-reading commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020

JorjMcKie commented Nov 20, 2020