Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

page.getText('html') returns empty string #726

Closed
further-reading opened this issue Nov 19, 2020 · 8 comments
Closed

page.getText('html') returns empty string #726

further-reading opened this issue Nov 19, 2020 · 8 comments
Assignees
Labels

Comments

@further-reading
Copy link

Describe the bug (mandatory)

page.getText('html') is returning an empty string for some files. Interestingly, page.getText('text') returns content so it is unclear why it is failing.

To Reproduce (mandatory)

Code:

import fitz  # import pymupdf by importing fitz
from io import BytesIO
import requests


# Working file
# url =  'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/09/30/nen_price_202011.pdf'

# Broken file
# url = 'https://miraiz.chuden.co.jp/home/electric/contract/fuelcost/unitprice/__icsFiles/afieldfile/2020/06/29/nen_price_202008.pdf'

res = requests.request('get', url)
data = BytesIO(res.content)
doc = fitz.open(stream=data, filetype="pdf")
page = doc[0]
text = page.getText('text')
html = page.getText('html')

When using the url tagged # Working file everything works fine. When using the url tagged # Broken file html is empty while text has content.

Expected behavior (optional)

I should have gotten the file converted to a html format, or if there is an issue parsing some sort of error message.

Your configuration (mandatory)

  • Ubuntu 20.04.1 LTS
  • Python 3.8.5
  • PyMuPDF version 1.18.3 installed via pip
@JorjMcKie
Copy link
Collaborator

Interesting case!
Output "xhtml" works, but not "html" / "xml". Also, mutool draw -o page.html nen_price_202008.pdf does work and produces an apparently correct file.
This tool and PyMuPDF use the same base MuPDF function fz_print_stext_page_as_html.
Investigating ...

@JorjMcKie
Copy link
Collaborator

Found the problem:
The nen_price_202008.pdf file contains non-UTF8 characters and I was using encoding error-handling "strict". Changing this to "replace" produces output.

@further-reading
Copy link
Author

Nice one, thanks for the quick response!

@JorjMcKie
Copy link
Collaborator

More specifically, the non-UTF8 characters only occur in the fontnames. You can have a look into this by comparing doc.getPageFontList(0) of the two PDFs.

@JorjMcKie
Copy link
Collaborator

you can download a pre-version wheel from here.
osx is already done, the linux branch is still in waiting queue.

@further-reading
Copy link
Author

Thanks for the quick fix! Just ran it locally on linux and it is working fine now.

@JorjMcKie
Copy link
Collaborator

I think I will make that version public over the weekend.

@JorjMcKie
Copy link
Collaborator

New version available on PyPI:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants