Pdf pages with transparent images return black image in html #657

rianspeed · 2020-09-14T16:06:09Z

Description:
Pdf document pages contains transparent png image. When trying to convert this document to Html the transparency is lost and the transparent part becomes black. Having this in black color makes the text on top become unreadable( if black font)
The reason being the transparent png is being converted to jpeg hence losing the alpha channel.

Code to reproduce:
`doc = fitz.open(document)
with open("sample_html.html", 'w', encoding="utf-8") as outfile:

for page in doc:
    html = page.getText("html")
    outfile.write(html)
outfile.close()

doc.close()`

Configuration :

Operating system: WIndows 10
Python version 3.7.7
PyMuPDF 1.17.2: Python bindings for the MuPDF 1.17.0 library.
Version date: 2020-06-20 07:00:22.
Built for Python 3.7 on win32 (64-bit).

Attaching sample pdf and html output.
Test_doc_for_pdf_transparent_image.pdf
sample_html.zip

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2020-09-14T16:28:18Z

Please try this conversion using the CLI tool of MuPDF, mutool draw or mutool convert.
If the issue persists, it is an upstream bug and you should report it to https://bugs.ghostscript.com/enter_bug.cgi.

rianspeed · 2020-09-14T17:15:57Z

Ok, I will try this and update here.

JorjMcKie · 2020-09-14T17:57:33Z

I tried it already. I can confirm, that it is an upstream bug.
Using PyMuPDF, the following asnippet recovers the original image:

>>> doc.getPageImageList(0)
[(5, 6, 1265, 1303, 8, 'DeviceRGB', '', 'Image5', 'FlateDecode')]  # an image (5) with a mask(6)
>>> img1=doc.extractImage(5)  # read the base image
>>> img1["ext"]
'png'
>>> img2=doc.extractImage(6)  # read the mask image
>>> pix=fitz.Pixmap(img1["image"])  # make pixmap of base image
>>> pix  # already has a transparency channel:
Pixmap(DeviceRGB, IRect(0, 0, 1265, 1303), 1)
>>> pix.writeImage("not-transparent.png")  # but the alpha values are all intransparent
>>> mask = fitz.Pixmap(img2["image"])  # pixmap of mask image
>>> mask
Pixmap(DeviceGray, IRect(0, 0, 1265, 1303), 0)
>>> pix.setAlpha(mask.samples)  # take its samples as alpha
>>> pix.writeImage("but-now.png")  # this is the original image!

rianspeed · 2020-09-15T05:03:21Z

@JorjMcKie Thank you ! I was not able to install mutool due to system restrictions, Could you please report this to ghostscript?
I also found that the transparency is maintained when we convert the page as image using the following code:

`pixmap = page.getPixmap(alpha = False)

pixmap.writePNG("page-%i.png" % page.number) # returns transparent image of the page`

Also as workaround I am planning to write a method to generate html, Is there a better quickfix?

Thanks,
Sandeep

JorjMcKie · 2020-09-15T08:48:14Z

Your page image is as it should be - as it is shown by a PDF viewer, too.
But you generated tha page imafe itself as being intransparent (alpha=False). So far no news.

But I though you absolutely want to have page copy, which shos correctly in a browser? And only that is the problem.

If you know how to integrate an SVG image in your HTML code, you can use the SVG image of the PDF page like so:

svg = page.getSVGimage()
out = open("page-%i.svg" % page.number, "w")
out.write(svg)
out.close()

This image is correctly rendered.

rianspeed · 2020-09-15T09:07:28Z

I am not working with pdf pages as images, rather getting the pages as html is my goal. So yes I want the page shown correctly in the browser as html. So I believe converting/integrating svg image of the page is not an option.

Thanks!

JorjMcKie · 2020-09-15T09:21:10Z

But you can show an SVG in a browser!
If you know HTML, you can wrap and display an SVG in a HTML skript.

Waiting for MuPDF to fix the bug will not get you to your goal any time within the next months.

JorjMcKie · 2020-09-15T10:19:36Z

If you

create a page-n.svg per page as shown above, and
create the following mini-html page-n.html per page,

<!DOCTYPE html>
<html>
<body>
<div id="page-n" style="position:relative;width:595pt;height:841pt;background-color:white">
<img style="position:absolute;top:0pt;left:0pt;width:595pt;height:841pt" src="page-n.svg">
</div>
</body>
</html>

... then the page will hopefuly show correctly in a browser.

JorjMcKie · 2020-09-15T11:21:50Z

This HTML code is also sufficient:

<!DOCTYPE html>
<html>

<body>
    <div id="page-n" style="background-color:white">
        <img src="page-n.svg">
    </div>
</body>

</html>

Browsers in general also support the compressed SVGZ format (some more specification in the HTML code needed). So you may consider to output your svg images GZIP compressed. This will reduce their file size by more than 50%.

rianspeed · 2020-09-15T11:33:14Z

@JorjMcKie Thanks much the update! I will have to look into this. There are some manipulations in the pdf like creating annotations and highlighting text before converting them to html. Need to check if these are retained as well. Also the text needs to be selectable. Will verify the same.

JorjMcKie · 2020-09-15T13:24:16Z

Also the text needs to be selectable.

This will not work with the approach. The rest will.

JorjMcKie · 2021-02-07T11:55:51Z

Ok, I will try this and update here.

@rianspeed - any reaction from MuPDF yet?

rianspeed added the bug label Sep 14, 2020

rianspeed assigned JorjMcKie Sep 14, 2020

JorjMcKie added upstream bug bug outside this package and removed bug labels Sep 14, 2020

dothinking mentioned this issue Oct 4, 2020

Question / Comment: extract transparent image with EMPTY color-space and soft-mask #677

Closed

JorjMcKie added the wontfix no intention to resolve label Apr 14, 2021

JorjMcKie closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pdf pages with transparent images return black image in html #657

Pdf pages with transparent images return black image in html #657

rianspeed commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020 •

edited

Loading

rianspeed commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020

rianspeed commented Sep 15, 2020 •

edited

Loading

JorjMcKie commented Sep 15, 2020

rianspeed commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020 •

edited

Loading

JorjMcKie commented Sep 15, 2020

rianspeed commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020 •

edited

Loading

JorjMcKie commented Feb 7, 2021

Pdf pages with transparent images return black image in html #657

Pdf pages with transparent images return black image in html #657

Comments

rianspeed commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020 • edited Loading

rianspeed commented Sep 14, 2020

JorjMcKie commented Sep 14, 2020

rianspeed commented Sep 15, 2020 • edited Loading

JorjMcKie commented Sep 15, 2020

rianspeed commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020 • edited Loading

JorjMcKie commented Sep 15, 2020

rianspeed commented Sep 15, 2020

JorjMcKie commented Sep 15, 2020 • edited Loading

JorjMcKie commented Feb 7, 2021

JorjMcKie commented Sep 14, 2020 •

edited

Loading

rianspeed commented Sep 15, 2020 •

edited

Loading

JorjMcKie commented Sep 15, 2020 •

edited

Loading

JorjMcKie commented Sep 15, 2020 •

edited

Loading