Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pdf pages with transparent images return black image in html #657

Closed
rianspeed opened this issue Sep 14, 2020 · 12 comments
Closed

Pdf pages with transparent images return black image in html #657

rianspeed opened this issue Sep 14, 2020 · 12 comments
Assignees
Labels
upstream bug bug outside this package wontfix no intention to resolve

Comments

@rianspeed
Copy link

Description:
Pdf document pages contains transparent png image. When trying to convert this document to Html the transparency is lost and the transparent part becomes black. Having this in black color makes the text on top become unreadable( if black font)
The reason being the transparent png is being converted to jpeg hence losing the alpha channel.

Code to reproduce:
`doc = fitz.open(document)
with open("sample_html.html", 'w', encoding="utf-8") as outfile:

for page in doc:
    html = page.getText("html")
    outfile.write(html)
outfile.close()

doc.close()`

Configuration :

  • Operating system: WIndows 10
  • Python version 3.7.7
  • PyMuPDF 1.17.2: Python bindings for the MuPDF 1.17.0 library.
    Version date: 2020-06-20 07:00:22.
    Built for Python 3.7 on win32 (64-bit).

Attaching sample pdf and html output.
Test_doc_for_pdf_transparent_image.pdf
sample_html.zip

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Sep 14, 2020

Please try this conversion using the CLI tool of MuPDF, mutool draw or mutool convert.
If the issue persists, it is an upstream bug and you should report it to https://bugs.ghostscript.com/enter_bug.cgi.

@rianspeed
Copy link
Author

Ok, I will try this and update here.

@JorjMcKie JorjMcKie added upstream bug bug outside this package and removed bug labels Sep 14, 2020
@JorjMcKie
Copy link
Collaborator

I tried it already. I can confirm, that it is an upstream bug.
Using PyMuPDF, the following asnippet recovers the original image:

>>> doc.getPageImageList(0)
[(5, 6, 1265, 1303, 8, 'DeviceRGB', '', 'Image5', 'FlateDecode')]  # an image (5) with a mask(6)
>>> img1=doc.extractImage(5)  # read the base image
>>> img1["ext"]
'png'
>>> img2=doc.extractImage(6)  # read the mask image
>>> pix=fitz.Pixmap(img1["image"])  # make pixmap of base image
>>> pix  # already has a transparency channel:
Pixmap(DeviceRGB, IRect(0, 0, 1265, 1303), 1)
>>> pix.writeImage("not-transparent.png")  # but the alpha values are all intransparent
>>> mask = fitz.Pixmap(img2["image"])  # pixmap of mask image
>>> mask
Pixmap(DeviceGray, IRect(0, 0, 1265, 1303), 0)
>>> pix.setAlpha(mask.samples)  # take its samples as alpha
>>> pix.writeImage("but-now.png")  # this is the original image!

@rianspeed
Copy link
Author

rianspeed commented Sep 15, 2020

@JorjMcKie Thank you ! I was not able to install mutool due to system restrictions, Could you please report this to ghostscript?
I also found that the transparency is maintained when we convert the page as image using the following code:

`pixmap = page.getPixmap(alpha = False)

pixmap.writePNG("page-%i.png" % page.number) # returns transparent image of the page`

Also as workaround I am planning to write a method to generate html, Is there a better quickfix?

Thanks,
Sandeep

page-0

@JorjMcKie
Copy link
Collaborator

Your page image is as it should be - as it is shown by a PDF viewer, too.
But you generated tha page imafe itself as being intransparent (alpha=False). So far no news.

But I though you absolutely want to have page copy, which shos correctly in a browser? And only that is the problem.

If you know how to integrate an SVG image in your HTML code, you can use the SVG image of the PDF page like so:

svg = page.getSVGimage()
out = open("page-%i.svg" % page.number, "w")
out.write(svg)
out.close()

This image is correctly rendered.

@rianspeed
Copy link
Author

I am not working with pdf pages as images, rather getting the pages as html is my goal. So yes I want the page shown correctly in the browser as html. So I believe converting/integrating svg image of the page is not an option.

Thanks!

@JorjMcKie
Copy link
Collaborator

But you can show an SVG in a browser!
If you know HTML, you can wrap and display an SVG in a HTML skript.

Waiting for MuPDF to fix the bug will not get you to your goal any time within the next months.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Sep 15, 2020

If you

  1. create a page-n.svg per page as shown above, and
  2. create the following mini-html page-n.html per page,
<!DOCTYPE html>
<html>
<body>
<div id="page-n" style="position:relative;width:595pt;height:841pt;background-color:white">
<img style="position:absolute;top:0pt;left:0pt;width:595pt;height:841pt" src="page-n.svg">
</div>
</body>
</html>

... then the page will hopefuly show correctly in a browser.

@JorjMcKie
Copy link
Collaborator

This HTML code is also sufficient:

<!DOCTYPE html>
<html>

<body>
    <div id="page-n" style="background-color:white">
        <img src="page-n.svg">
    </div>
</body>

</html>

Browsers in general also support the compressed SVGZ format (some more specification in the HTML code needed). So you may consider to output your svg images GZIP compressed. This will reduce their file size by more than 50%.

@rianspeed
Copy link
Author

@JorjMcKie Thanks much the update! I will have to look into this. There are some manipulations in the pdf like creating annotations and highlighting text before converting them to html. Need to check if these are retained as well. Also the text needs to be selectable. Will verify the same.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Sep 15, 2020

Also the text needs to be selectable.

This will not work with the approach. The rest will.

@JorjMcKie
Copy link
Collaborator

Ok, I will try this and update here.

@rianspeed - any reaction from MuPDF yet?

@JorjMcKie JorjMcKie added the wontfix no intention to resolve label Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream bug bug outside this package wontfix no intention to resolve
Projects
None yet
Development

No branches or pull requests

2 participants