Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limited Support for ClipPaths #2011

Closed
griai opened this issue Nov 1, 2022 · 4 comments
Closed

Limited Support for ClipPaths #2011

griai opened this issue Nov 1, 2022 · 4 comments

Comments

@griai
Copy link

griai commented Nov 1, 2022

Is your feature request related to a problem? Please describe.
I have to give a bit of context before writing about the actual feature request. In the team that I am part of, we are developing a library that on the one hand performs geometric tasks on graphical (vector) objects from a pdf. Such tasks involve affine transformations, rasterization of Bézier curves, or detection and merging of objects sharing similar properties. On the other hand, we also want to support different extraction tasks concerning text content (at arbitrary positions) in pdf files. The input documents in our case are always pdf documents. In many cases, such pdf files contain clip paths that typically alter the appearance of said objects to quite a large extent. We would, therefore, wish for some kind of limited support for clip paths (I will detail what exactly we envision further down below) in PyMuPDF.

Describe the solution you'd like
PyMuPDF currently provides the extremely useful functions Page.get_drawings(), which returns pdf path objects, and Page.get_textpage(), which returns text objects with their precise positions (and rotation and everything else). Both of these two functions are at the heart of what we use in our input stage in order to read a pdf file and they make it completely unnecessary for us to parse the input files ourselves, which makes our lives incredibly more easy in that respect. At a later stage, we are then delegating many geometric operations (like intersections, unions, ...) to another library such as Shapely.
It has been said several times already in answers to other issues (e.g. #1518) that support for clipping in Page.get_drawings() is currently not on the roadmap for PyMuPDF. We have full understanding for this because in order to calculate clipping, PyMuPDF would have to do a lot more geometry by itself or would have to rely on other packages like Shapely to do the task, which would introduce dependencies that one would rather try to avoid.
However, we need to do similar geometry tasks anyway and we would greatly benefit from some limited support for clip paths. What do we mean by this? In order to perform the necessary clipping on our side, we would need at least the defined clip paths from a pdf file as well as the mapping which clip paths are used by other paths. Having this "raw" information would enable us to actually calculate the clipping ourselves (we believe). Considering the parsing of a pdf file, it seems that clip paths could probably be extracted just like other paths (StrokePaths and FillPaths). The more complicated thing could be the mapping from the other paths to their respective clip paths that they use, which would necessitate additional attributes to Path objects or a new data structure on its own.
The question for us would be if such things look reasonable and feasible from a PyMuPDF point of view. We (obviously) have not worked out any details, yet, but nevertheless wanted to already ask for your opinions about that. If such a raw mode was at least possible, we could also think about dedicating a part of our developing time to PyMuPDF and trying to help with the implementation. But I have to add that we are currently all developers working in Python and none of us has experience in SWIG. So, in any case, we could benefit from some guidance.

Describe alternatives you've considered
The alternatives are, unfortunately, not so many. We are bound by the requirement to read pdf files (with vector data) and PyMuPDF in our experience provides by far the best Python bindings to work with pdf files. We also thought about converting to other formats, such as svg. In fact, we used an svg workflow with svgelements before, but PyMuPDF is way easier to work with (and proved to have incredibly helpful and very fast maintainers) and we always had some issues with different converters. For examples, some do not keep text as text, others form complicated nested structures of objects (with different transformations), others produce a completely flat structure and only support graphical primitives. Another huge disadvantage was that the svg files are typically much larger than (compressed) pdf files and the processing times were way slower compared to PyMuPDF leveraging the mupdf binaries. (Other libraries we looked at include svgpathtools and svglib, but none of them met our requirements when we tested them.)

We are grateful for any input on the topic and would be happy to see (or help with) such a "raw mode" for clip paths in PyMuPDF.

Greetings from Dresden, Germany ;-)

@JorjMcKie
Copy link
Collaborator

Greetings from Dresden, Germany ;-)

Thank you very much indeed for this very kind and rewarding feedback! We currently are very busy with the new version 1.21.0 and its exciting new features.
However, the maintenance team has taken note and will discuss your request in due time. We will keep you posted.

I was the one who created this part some time ago. You may be aware of a similar MuPDF output in XML format of the very same information (actually even a superset, because text is also reported). This function can be invoked by mutool trace file.pdf <pages> (where pages is the usual MuPDF page down-selector: comma-separated 1-based page numbers and ranges 1,3-6,N = pages 1, 3 -6 and last page).
The trace tool reports the clip paths, and the nesting of paths in general, in more detail. If you - as a first step - want to consider using an XML tool like lxml, it may give you a start.
Remembering my design considerations, a major driver was the reproducibility of get_drawings() by the Shape.draw_*() methods. And there you will not find support for clippings nor nesting of paths.
Just passing through what is being found in the appearance source of the page may be easier if not impeded by reproduction concerns ... We'll see.

Beste Grüße zurück in das ehemalige Tal der Ahnungslosen! Ich stamme aus Sachsen, aber eher Nähe Delitzsch / Bitterfeld. In Dresden war ich nur einmal nach der Wende, eine schöne Stadt!
Bei diesem tollen Feedback bin ich doch wirklich interessiert zu wissen, mit wem ich hier so nett rede. Hättest du / Sie etwas dagegen? Meine E-Mail: jorj.x.mckie@outlook.de.

@JorjMcKie
Copy link
Collaborator

Were you able to use the mutool trace output? Do you need help with it, or would you like to discuss other alternatives?
Just wanted to make sure you do not hesitate asking.
I am also more than willing to have a discussion via my personal e-mail, jorj.x.mckie@outlook.de. There also is a dedicated channel on Discord where you can chat with the PyMuPDF developer team.

@griai
Copy link
Author

griai commented Nov 4, 2022

I was just about to write you. Sorry for not having answered earlier. The past days were very busy, but today is my free day, so I have more time for e-mails. ;-)
Yes, we know about mutool trace and, in fact, consider it as a kind of fallback. However, that would take away from us all the niceties that get_drawings() offers.
I will also send you a personal e-mail in the next minutes ...

@JorjMcKie JorjMcKie mentioned this issue Jan 22, 2023
JorjMcKie added a commit that referenced this issue Mar 6, 2023
Implements #2011.

Adds support for Pixmap JPEG output.

Fixes #2248, #2210, which are duplicates of #2108 - fixed here.

Adds support for drawing rectangles with rounded corners.
julian-smith-artifex-com pushed a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 7, 2023
Implements pymupdf#2011.

Adds support for Pixmap JPEG output.

Fixes pymupdf#2248, pymupdf#2210, which are duplicates of pymupdf#2108 - fixed here.

Adds support for drawing rectangles with rounded corners.
julian-smith-artifex-com pushed a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
Implements pymupdf#2011.

Adds support for Pixmap JPEG output.

Fixes pymupdf#2248, pymupdf#2210, which are duplicates of pymupdf#2108 - fixed here.

Adds support for drawing rectangles with rounded corners.
julian-smith-artifex-com pushed a commit to ArtifexSoftware/PyMuPDF-julian that referenced this issue Mar 13, 2023
Implements pymupdf#2011.

Adds support for Pixmap JPEG output.

Fixes pymupdf#2248, pymupdf#2210, which are duplicates of pymupdf#2108 - fixed here.

Adds support for drawing rectangles with rounded corners.
julian-smith-artifex-com pushed a commit that referenced this issue Mar 14, 2023
Implements #2011.

Adds support for Pixmap JPEG output.

Fixes #2248, #2210, which are duplicates of #2108 - fixed here.

Adds support for drawing rectangles with rounded corners.
@JorjMcKie
Copy link
Collaborator

Fixed in v1.22.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants