Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector Image Drawings are not replicated correctly #1518

Closed
ayusonkj opened this issue Jan 6, 2022 · 15 comments
Closed

Vector Image Drawings are not replicated correctly #1518

ayusonkj opened this issue Jan 6, 2022 · 15 comments
Assignees
Labels

Comments

@ayusonkj
Copy link

ayusonkj commented Jan 6, 2022

Please provide all mandatory information!

Describe the bug (mandatory)

The object "TCA" in the pdf is not being replicated correctly.

To Reproduce (mandatory)

Using the examples in the pymupdf documentation, I attempt to replicated the vector drawings

import fitz
doc = fitz.open("pdf_test_page_1.pdf")
page = doc[0]
paths = page.get_drawings()  # extract existing drawings
# this is a list of "paths", which can directly be drawn again using Shape
# -------------------------------------------------------------------------
#
# define some output page with the same dimensions
outpdf = fitz.open()
outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
shape = outpage.new_shape()  # make a drawing canvas for the output page
# --------------------------------------
# loop through the paths and draw them
# --------------------------------------
for path in paths:
    # ------------------------------------
    # draw each entry of the 'items' list
    # ------------------------------------
    for item in path["items"]:  # these are the draw commands
        if item[0] == "l":  # line
            shape.draw_line(item[1], item[2])
        elif item[0] == "re":  # rectangle
            shape.draw_rect(item[1])
        elif item[0] == "qu":  # quad
            shape.draw_quad(item[1])
        elif item[0] == "c":  # curve
            shape.draw_bezier(item[1], item[2], item[3], item[4])
        else:
            raise ValueError("unhandled drawing", item)
    # ------------------------------------------------------
    # all items are drawn, now apply the common properties
    # to finish the path
    # ------------------------------------------------------
    shape.finish(
        fill=path["fill"],  # fill color
        color=path["color"],  # line color
        dashes=path["dashes"],  # line dashing
        even_odd=path.get("even_odd", True),  # control color of overlaps
        closePath=path["closePath"],  # whether to connect last and first point
        lineJoin=path["lineJoin"],  # how line joins should look like
        lineCap=max(path["lineCap"]),  # how line ends should look like
        width=path["width"],  # line width
        stroke_opacity=path.get("stroke_opacity", 1),  # same value for both
        fill_opacity=path.get("fill_opacity", 1),  # opacity parameters
        )
# all paths processed - commit the shape to its page
shape.commit()
outpdf.save("drawings-page-0.pdf")

Expected behavior (optional)

All Vector drawings are replicated

Screenshots (optional)

Sample Pdf Input:
image

Sample Pdf Output:
image

pdf files used for testing:
sample_pdf.zip

Your configuration (mandatory)

VERSION="20.04.3 LTS (Focal Fossa)"
--ID=ubuntu
--VERSION_ID="20.04"
Python 3.8.10
PyMuPDF 1.18.17
Installed via wheel

@ayusonkj
Copy link
Author

ayusonkj commented Jan 6, 2022

Here is another test with a different pdf page
sample pdf:
tc_main_2_page_3.pdf

Sample Screenshot input:
image

Sample Screenshot output:
image

@JorjMcKie
Copy link
Collaborator

Some general comments to start with:

  • Images and text are always ignored if you recreate the drawings
  • As noted in the documentation, drawings extraction is not perfect (and therefore neither reproduction can be). Several features are ignored like clipping or shading - and this is an incomplete list.

All Vector drawings are replicated

So you cannot ever expect to get a perfect reproduction - even not for the drawings part alone. So I must turn down this part of your expectation.

Nevertheless, there is a bug in reproducing characters correctly, like
grafik
instead of
grafik
I have located and corrected this one. So your first example should look ok with it.
For the second PDF, there simply is no hope to ever get this right: it is full of PDF feature exploitations, that PyMuPDF will never support.

@JorjMcKie
Copy link
Collaborator

There is new pre-version wheel here if you would like to test the changes.

@ayusonkj
Copy link
Author

ayusonkj commented Jan 7, 2022

@JorjMcKie Thanks for the prompt response, is there anyway we could ignore these exploitations like the shadings? so the result would be a bit presentable unlike the sample screenshot above in my seconds example?

@ayusonkj
Copy link
Author

ayusonkj commented Jan 7, 2022

Btw, I just finished testing the pre-version 1.19.5 of PyMUPDF. it is working as expected. Thanks a lot :)

@JorjMcKie
Copy link
Collaborator

is there anyway we could ignore these exploitations like the shadings?

Not reall, because they are interwoven with the rest.
It also depends what you ultimately need. If it is just making a page without the images and text, things are a lot easier: simply remove them by using redaction annotations - this will leave just the drawings intact, and you could use that stripped-down page somewhere else.
You could also make an SVG image from it that can be used in browsers, etc.

@JorjMcKie
Copy link
Collaborator

The other example you sent me (A500...) is misleading: if you remove text and images, the page looks exactly as produced by redrawing the paths.
So no problem.

@ayusonkj
Copy link
Author

ayusonkj commented Jan 7, 2022

@JorjMcKie the end goal is actually to create an svg image(s) (based on the targeted Rect) without the raster images.

I would try to use redaction to remove this images. Is there any way we can identify if there are any "unsupported" exploitations? that way, instead of re_drawing the vector drawings, if unsupported exploitations detected, I can opt to do the redactions instead.

@ayusonkj
Copy link
Author

ayusonkj commented Jan 7, 2022

The other example you sent me (A500...) is misleading: if you remove text and images, the page looks exactly as produced by redrawing the paths. So no problem.

Yes, I actually realized this when I checked for images and text. on what I thought was a background fill, was actually a raster image. apologies for that.

@JorjMcKie
Copy link
Collaborator

apologies for that.

Bah, no problem at all.

I would try to use redaction to remove this images. Is there any way we can identify if there are any "unsupported" exploitations? that way, instead of re_drawing the vector drawings, if unsupported exploitations detected, I can opt to do the redactions instead.

Ok, then step 1 of your approach would always be removing text and raster images:

>>> page.add_redact_annot(page.rect)
'Redact' annotation on page 0 of A500.pdf
>>> page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)
True
>>> page.clean_contents()

Then identify the rectangle you need and set the page's CropBox to it.
Then make an SVG image from that curated page.
Here is a snippet doing all this.

  • Page after previous step - without raster images and text:

grafik

  • Then:
# identify rectangle of interest
>>> r=page.cropbox
>>> r += (300, 300, -300, -300)  # in our case just go away from borders somewhat
>>> out=open("reduced.svg", "w")  # save the SVG file here
>>> page.set_cropbox(r)  # reduce page to rectangle of interest
>>> page.cropbox  # looks like this
Rect(300.0, 300.0, 1218.0, 714.0)
>>> out.write(page.get_svg_image())  # save SVG image
1018254
>>> out.close()
>>> 
  • This is how the SVG look like:

grafik

@JorjMcKie
Copy link
Collaborator

In contrast to pixmaps, SVG images do not support a clip parameter.
But you can use the above trick with modifying the CropBox (= visible part of the page).
As long as you do not accidentically save the PDF, all this is just temporary. And you can revert the change of course - or set it to yet another rectangle ...

@ayusonkj
Copy link
Author

Thanks for the update, I tried redacting the images and it looks a lot better than the previous approach.

Is there anyway that Pymupdf can detect if such PyMUPDF features are ignored like the clippings and shading that you have mentioned?
I am thinking along the lines of:

if feature_not_supported:
   <redact images>
else:
   <re_draw drawings>

@JorjMcKie
Copy link
Collaborator

This is difficult to answer in a definite way. There are various methods that extract releated information:

  • page.get_image_info() list all images actually displayed. If empty, no images exist.
  • page.get_text() is the well-known text extraction. If empty, no text is there.
  • page.get_bboxlog() lists the rectangles covered by the various object types on the page.

The last one may be interesting for you. For the example tc_main_2_page_3.pdf you would get

>>> bboxlog = page.get_bboxlog()
>>> pprint(bboxlog[:10])
[('fill-path',
  (759.0924072265625, 621.6221313476562, 1421.7763671875, 936.5750732421875)),
 ('fill-path',
  (893.4117431640625,
   541.5783081054688,
   1436.5706787109375,
   936.5743408203125)),
 ('fill-path',
  (759.0921020507812,
   77.68048095703125,
   1436.571044921875,
   461.53448486328125)),
... 
>>> box_types = set([b[0] for b in bboxlog])
>>> box_types
{'fill-image', 'stroke-path', 'fill-path', 'fill-shade'}
>>>

Each item is a tuple (box_type, rect-like).
If text is present, you would have a bbox type of "fill-text" in addition.
So you can determine whether images, shades or text are present.
"stroke-path" and "fill-path" represent drawings.

@ayusonkj
Copy link
Author

@JorjMcKie thanks for the help.

in the end, I went to something like these:

    doc = fitz.open("20201209_TCAMAIN_2_page_3.pdf")
    page = doc[0]
    paths = page.get_drawings() 
    bbox_log = dict(page.get_bboxlog())
    if 'fill-shade' in bbox_log.keys():
        page.add_redact_annot(page.rect)
        page.apply_redactions(images=fitz.PDF_REDACT_IMAGE_REMOVE)
        page.clean_contents()
        doc.save("20201209_TCAMAIN_2_page_3_test_out.pdf")
    else:
        <re_draw vector drawings method>

after comparing some of the pdf samples I had, I noticed that the only ones that are not reproducing correctly are the ones with the "fill-shade" key value so I used that as a common denominator wether to re-draw vectors or to redact images.

@JorjMcKie
Copy link
Collaborator

Fixed in v1.19.5 currently being uploaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants