Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question / Comment: Support for layers/optional content #709

Closed
Evidlo opened this issue Oct 29, 2020 · 60 comments
Closed

Question / Comment: Support for layers/optional content #709

Evidlo opened this issue Oct 29, 2020 · 60 comments
Assignees
Labels
question resolved fixed / implemented / answered

Comments

@Evidlo
Copy link

Evidlo commented Oct 29, 2020

I've browsed through the examples and the source, but I haven't seen any mention of optional content groups. mupdf has this functionality in pdf-layer.c

Is there some lower level way to access these functions or will support need to be added explicitly?

@JorjMcKie
Copy link
Collaborator

Is there some lower level way to access these functions or will support need to be added explicitly?

There is "free" access to low level information via manipulation of object definitions and of stream content. Both in the end mean basically dealing with strings and bytes objects.
I cannot tell at the moment though, whether this may achieve your intentions.

@JorjMcKie
Copy link
Collaborator

I am willing to deal with this topic, although I cannot promise a timeframe for implementation.
For speeding up and simplifying the process:
Can you help and spare me a some research by providing the following input:

  • what is your goal of getting access to OCGs?
  • what API features would you need to achieve your goals?
    • information on available and or currently active OCGs?
    • switching to a certain OC?
    • ...?
      Ideally provide pseudo-code descriptions / design of the methods you would like to see ...

@Evidlo
Copy link
Author

Evidlo commented Oct 29, 2020

What is your goal of getting access to OCGs?

I'm working on an application which allows marking up PDFs using the Ink annotation. In order to differentiate these annotations from those created by external tools, I want to put my annotations in an OCG (e.g. 'mycustomapp annots').

In order to achieve this, the minimum needed functionality is

  • ability to create OCGs
  • ability to set OCG on objects

Although I guess the average user would also want to be able to modify and delete existing OCGs and set their visibility properties.

Examples of other OCG APIs:

@JorjMcKie
Copy link
Collaborator

I have developed a first cut for this functionality. Don't want to publish it yet, but I am certainly interested in you testing it.
Please tell me your configuration (Py version, bitness, OS), so I can provide a pre-release wheel.

Here is a rough synopsis of the features and implemented API

  • create OCG: doc.addOCG(name, config=-1, on=1, intent=None). Creates an optional content group - if required also makes the minimum PDF catalog changes for this kind of support.

    • name: arbitrary name string
    • config: -1 default configuration, else a configuration must have been previously set up.
    • on: 1/True = visible, 0/False = hidden
    • intent: string or list of strings, default "View"
  • list OCGs: doc.getOCGs(), a tuple of dictionaries of OCGs

>>> doc.addOCG("hide", on=False)  # default status hidden
>>> for item in doc.getOCGs(): print(item)

{'xref': 132, 'name': 'hide', 'intent': ['View']}
>>> # the on-status is not part of the OCG, so it doesn't show here
  • list configuration of modifyable ON / OFF congurations: doc.layerUIConfigs().
  • set / unset visibility: doc.setLayerConfigUI(number, action=0).
    • number: number shown in previous list
    • action: 0=set visible, 1=toggle visibility, 2=set invisible
>>> for item in doc.layerUIConfigs(): print(item)

{'number': 0, 'text': 'hide', 'depth': 0, 'type': 'checkbox', 'selected': False, 'locked': False}
>>> # refers to OCG with name "hide", which has default status invisible ==> selected=False.
>>> # let's make it visible:
>>> doc.setLayerConfigUI(0, action=0)
>>> for item in doc.layerUIConfigs(): print(item)

{'number': 0, 'text': 'hide', 'depth': 0, 'type': 'checkbox', 'selected': True, 'locked': False}
>>> # selected=True now!
  • images, form XObjects and annotations now support optional content spec:
    • page.insertImage( ..., oc=xref, ...) and page.showPDFpage(..., oc=xref, ...). xref is the xref of an OCG created before (wil be type-checked).
    • annot.setOC(xref): similar to previous, but setOC(0) will remove any optional content reference.

There is more functionality already, but the above should be enough for your immediate needs.

@JorjMcKie
Copy link
Collaborator

Please find your wheel either here (Linux) or here (OSX).
Notify me if you have Windows, will need to drop that wheel here.

@Evidlo
Copy link
Author

Evidlo commented Nov 6, 2020

Thanks for the quick turnaround.

images, form XObjects and annotations now support optional content spec:

Can't any element technically be marked as optional content? I'm not familiar with MuPDF so I don't know the implications of trying to implement this.

I should have mentioned that I need to find which annotations belong in a particular content group, maybe with annot.getOC()/annot.oc.

Also, are you implementing OCG support yourself or mostly wrapping functions in pdf-layers.c? I may be writing a C application which uses MuPDF in the long term and I'm interested in seeing what you're doing.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Nov 6, 2020

Can't any element technically be marked as optional content?

No, only those 3 object types. This is a PDF spec - not a MuPDF restriction. In the PDF specification, there is the possibility to also make text and drawings OC-dependent, but that's an even greater effort, which I haven't invested yet.

Support of images, form xobjects and annotations was not that big a deal.

Also, are you implementing OCG support yourself or mostly wrapping functions in pdf-layers.c?

In MuPDF, only informational code (not bug-free!) exists, and I am wrapping it.
Creating OC stuff is my doing - and that was not really trivial - mostly though because I first had to understand the whole concept.

Okay, annot.getOC() will be it. It will return the OCG xref. Information on that OCG can then be extracted from doc.getOCGs(), which I will change to be a Python dict with xref as key (not a list, which it is now).

@JorjMcKie
Copy link
Collaborator

There are now new available wheels reflecting my previous post.

@JorjMcKie
Copy link
Collaborator

The whole thing works like this:

>>> import fitz
>>> from pprint import pprint
>>> doc=fitz.open("ocg-test.pdf")
>>> page=doc[0]
>>> annot = page.addFreetextAnnot((100,300,300,400), "rect now visible")
>>> pprint(doc.getOCGs())
{13: {'hidden': False,
      'intent': ['View', 'Design'],
      'name': 'Circle',
      'usage': 'Artwork'},
 14: {'hidden': False,
      'intent': ['View', 'Design'],
      'name': 'Square',
      'usage': 'Artwork'},
 15: {'hidden': True, 'intent': ['View'], 'name': 'Square', 'usage': 'Artwork'}}
>>> annot.setOC(15)  # show annot if "Square" is set to on
>>> doc.save(...)
>>> # inquiry of an annot's OC:
>>> xref = annot.getOC()
>>> doc.getOCGs()[xref]
{'name': 'Square', 'intent': ['View'], 'hidden': True, 'usage': 'Artwork'}
>>> 

Next, I will streamline all those status terms "selected", "hidden", "on" into just one: "on".

@canedha
Copy link

canedha commented Nov 7, 2020

Dear @JorjMcKie ,

with great interest I have read and followed this topic as I am searching for the possibility of working with layered (OCG) pdfs as well.
I'm working on Windows 64bit with Python 3.8 and would be happy to help with testing and improving the functionalities. If you could also release the wheel for windows, that would be awesome!

my usecase or needed functionality would simply be:

  • create OCG layers and fill them

you mentioned that ony:
_

"images, form XObjects and annotations"

_
can be marked as OCG. If I e.g. want to crop a page of an existing pdf and insert into a OCG layer of a new pdf, would that be possible? I have no clue, what type of object this would be. right now I'm using the ".showPDFpage" command from pymupdf and simply define

(rect, docsrc, pno = 0, keep_proportion = True, overlay = True, reuse_xref = 0, clip = None)

rect = area in new pdf, where croppe input is placed
docsrc, pno = define source pdf & page
clip = define area in source pdf to be cropped into rect area

If I read your text and instructions correctly I could simply use your wheel with the new page.showPDFpage(..., oc=xref, ...) command, where oc= xxx defines in which (previously created) OCG layer the content is placed?

Hope to hear back from you,
Toby

@JorjMcKie
Copy link
Collaborator

Hi Toby,

overall: yes, that should work like this. Just a few comments to make sure we are in sync:

  • OCG must be an optional content group, which you either must create yourself via xref = doc.addOCG("name", on=True/False, ...) or reuse an existing one. In the latter case do pprint(doc.getOCGs()) and pick its xref from that dictionary.
  • Argument reuse_xref in showPDFpage() is deprecated and can / should be omitted.
  • It sounds like you want to show the copied-over source page under some condition only in the new file. Fine. If you turn the respective OCG to OFF, then the rectangle on the page on which the shown source page lives will become empty. The page does not disappear as a page from the new PDF.

@JorjMcKie
Copy link
Collaborator

One more thing:
You are aware, that multiple items in a PDF (annots, images, form xobjects) can be dependent on the same OCG?
So you can switch ON / OFF multiple objects by changing the state of just one OCG.

What I haven't looked at is another potentially useful feature, so-called radio button OCG groups. Each such group (array) switches off all other of its OCGs , if one OCG is switched on - a behaviour similar to grouped radio buttons.

Might be interesting in your case: instead of simply showing an empty page if the OCG is switched to OFF, an alternative content might switched on automatically ...

@canedha
Copy link

canedha commented Nov 7, 2020

Hey @JorjMcKie,

thx for your clarification! yes, we are absolutely in sync.
In my case I have a pdf page with content that is always shown and have additional content that I want to be optional (which would be put in the ocg layers).
I understand those layers like overhead projectors we had back in the day. you have a source paper and some transparent foils with additional content. even if you remove all transparent foils, the core page still is present :)
the option to have something switched on by switching off sth else is not sth that I would need, but for sure I'm happy to test it for you!

can you create the 64bit Windows version for me? How long would this take? Don't get me wrong on this, I don't want to hurry/pressure you as I'm really thankful that you are doing it in the first place.
I simply ask as I have small children (therefore rarely free time) and tonight I would have a couple of minutes 😅

Thanks in advance,
Toby

@JorjMcKie
Copy link
Collaborator

What is your exact config: Py version, bitness
Can respond within 5 minutes with a wheel

@canedha
Copy link

canedha commented Nov 7, 2020

Awesome!

I'm running python 3.8 with Anaconda 64bit on Win 10 (also 64bit).
Do you need any other information? :)

Thanks in advance,
Toby

@JorjMcKie
Copy link
Collaborator

PyMuPDF-1.18.3-cp38-cp38-win_amd64.zip
Rename the extension back to whl again. This is a restriction of Github.

@JorjMcKie
Copy link
Collaborator

... as I have small children ...

Thanks to who knows who: that's decades behind me! 😎

@canedha
Copy link

canedha commented Nov 7, 2020

Hi @JorjMcKie ,

I just tried your updated command and in this first test it worked like a charm!
Thx so much!

  • adding OCG layer (.addOCG)
  • assign object to specified xref (.showPDFpage(... , xref=XXX , ...)

easy as that :)

I will play around more with the function and give feedback. As soon as you have some more functionality implemented and an update of the wheel let me know. I will happily test it for you :)

Have a nice evening,
Toby

@Evidlo
Copy link
Author

Evidlo commented Nov 8, 2020

In MuPDF, only informational code (not bug-free!) exists, and I am wrapping it.
Creating OC stuff is my doing - and that was not really trivial - mostly though because I first had to understand the whole concept.

Would you ever consider merging some of these features into MuPDF itself? I'm sure there are others who would like to take advantage of layers (like me!)

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Nov 8, 2020

@Evidlo - most if not all of that mentioned code indeed is written C - maybe about one hundred lines or so.
But I am using CPython functions there, so a port of these to native C would have to happen in any case. Certainly doable.

That Artifex / MuPDF would ever consider doing or re-integrating this, can be safely forgotten. They understandably have their own plans - technically and release planning-wise.

If you want to write something in C, you best directly adopt my C-code, replacing the CPython pieces.

Concerning the MuPDF bugs - these are around inconsistent handling of OC configuration layers:

  • OC is defined in the PDF catalog underneath the key /OCProperties. The default configuration is defined inside the sub dictionary keyed as /D. Additional optional layers, if any, are defined in array /Configs. Each item of this array is a dictionary with a structure like /D.
  • The error is around counting / accessing these config layers:
    • One of the MuPDF functions returns the number of layers. It returns 1 if /Configs does not exist. If it exists, it returns the length of that array - which is wrong: it should be 1 more because of /D.
    • The code does not cover the case where /Configs exists, but is an empty array - a legal, although rare situation.
    • After PDF open, the first configuration of /Configs is activated. This is wrong - it should be /D. Again, an empty /Configs is not properly handled here.
    • When programmatically selecting a config layer, the above issues will prevent to ever access the /D default layer.Very unpleasant. The only way to resolve this currently, is to merge and dissolve the /Configs with a MuPDF function - which removes /Configs and replaces /D with the amalgamated information.
  • So my recommendation today is to not use additional layers and stick with the default layer.

I have submitted a bug report to MuPDF - just yesterday. I will post any news in this channel.

@canedha
Copy link

canedha commented Nov 8, 2020

Good morning @JorjMcKie,
do I understand you that these bugs only occur when you try to programmatically access the layers?
that means, if I create a multi-ocg-layered pdf using your function and then only use it as read-only in viewers everything should be fine?
or would you expect any bugs there?

cheers,
Toby

@JorjMcKie
Copy link
Collaborator

@canedha - not quite so. If you do this:

>>> doc=fitz.open()
>>> doc.addOCG("ocg1")
3
>>> doc.addOCG("ocg2")
4
>>> doc.addOCG("ocg3")
5
>>> doc.PDFCatalog()
1
>>> print(doc.xrefObject(1))
<<
  /Type /Catalog
  /Pages 2 0 R
  /OCProperties <<
    /Configs [ ]
    /OCGs [ 3 0 R 4 0 R 5 0 R ]
    /D <<
      /AS [ ]
      /ON [ 3 0 R 4 0 R 5 0 R ]
      /OFF [ ]
      /Order [ 3 0 R 4 0 R 5 0 R ]
      /RBGroups [ ]
    >>
  >>
>>
>>> 

There is no problem at all. Add new OCGs as many as you like this way.
In this case I have allowed all of the OCGs to default to ON. This state can temporarily only be changed in a PDF viewer.

But what if you made an error and actually want to set OCG 3 permanently OFF? There is - currently - no way to directly do this. However, it can be done: You have to use layers (in two steps) to achieve it:

>>> doc.addLayerConfig("layer1", on=[4, 5])
>>> # this new layer, when activated, has all OCG set to OFF except the ones in the 'on' list.
>>> print(doc.xrefObject(1))  # see how this looks like
<<
  /Type /Catalog
  /Pages 2 0 R
  /OCProperties <<
    /Configs [ <<
          /Name (layer1)
          /BaseState /OFF
          /ON [ 4 0 R 5 0 R ]
        >> ]
    /OCGs [ 3 0 R 4 0 R 5 0 R ]
    /D <<
      /AS [ ]
      /ON [ 3 0 R 4 0 R 5 0 R ]
      /OFF [ ]
      /Order [ 3 0 R 4 0 R 5 0 R ]
      /RBGroups [ ]
    >>
  >>
>>
>>> 

Because of the bugs, this situation should best not be saved to disk, but instead do this:

>>> doc.setLayerConfig(0, as_default=True)  # make the layer the default, i.e. ==> /D
>>> print(doc.xrefObject(1))
<<
  /Type /Catalog
  /Pages 2 0 R
  /OCProperties <<
    /D <<
      /Intent /View
      /ON [ 4 0 R 5 0 R ]
      /BaseState /OFF
      /Order [ 3 0 R 4 0 R 5 0 R ]
    >>
    /OCGs [ 3 0 R 4 0 R 5 0 R ]
  >>
>>
>>> 

Now you are safe again: 4 and 5 are ON, the rest (only 3 in this case) are OFF.

@JorjMcKie
Copy link
Collaborator

The setLayerConfig() method (wrapping a MuPDF function) deletes all members of /Configs and makes the selected one the default /D if you specify True in the argument.
It's peculiarity is to always declare the base (= default) state to OFF: /BaseState /OFF. That's why we have to go this way.

Another option might of cause be to directly toggle an OCG between the ON and OFF arrays. Maybe this is a next step.
It is a bit more complicated than it seems though, because not only both array have to be inspected, but also any existing /BaseState entry as well ... but let's see.

@JorjMcKie
Copy link
Collaborator

If you a PDF viewer which can make permanent changes, then the only precaution is to not create new configuration layers - or clean the situation up as mentioned before.

@canedha
Copy link

canedha commented Nov 8, 2020

@JorjMcKie
Ah, got it. Thx for clarification! In my case I have a read-only pdf at the end, therefore this is not an issue for me :)

even of the risk of going a bit off-topic (still OCG related though)
Is there a way you know with PyMuPDF (or somehow else) to export layers of an svg to an ocg-layered pdf?
or would I have to use a workaround of creating a svg for each layer, export to pdf and then put the single-layered pdf into a multi-layered one?

@JorjMcKie
Copy link
Collaborator

Is there a way you know with PyMuPDF (or somehow else) to export layers of an svg to an ocg-layered pdf?

Interesting question. You can convert SVG to PDF using (Py-) MuPDF. Haven't seen an example with a layered SVG yet. Chances are that it works!

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Nov 8, 2020

Snippet:

>>> import fitz
>>> svg = fitz.open("silicon2018n.svg")
>>> pdfbytes = svg.convertToPDF()
>>> doc=fitz.open("pdf", pdfbytes)
>>> doc.save("silicon2018n.pdf")
>>> 

You could use doc.getOCGs() (instead of the save): if empty then it doesn't work ...

@canedha
Copy link

canedha commented Nov 8, 2020

ok, will try that later! even if it doesnt work for layers, this already helps me a lot as I didn't know before that PyMuPDF can read svg :)

@JorjMcKie
Copy link
Collaborator

New version with these changes being uploaded.

@JorjMcKie
Copy link
Collaborator

The just published version has two new document methods getOCStates()/ setOCStates() which inform about the ON/OFF states of optional content or, respectively, offer permanent mass changes - as opposed to temporary ones via the UI of PDF viewers.
So there is no more need to use additional configuration layers for achieving this.

@canedha
Copy link

canedha commented Nov 9, 2020

Hi @JorjMcKie,

thanks for all the work and time you put into this issue! it's really appreciated!

one last question: where to find your latest version? could you again make a windows 64bit, python 3.8 wheel and provide the link?

thanks in advance,
Toby

@JorjMcKie
Copy link
Collaborator

@canedha - the official version is on PyPI as always und can be installed via pip install -U pymupdf for your system.
This is true for Windows, Linux and Mac OSX, Python 3.6 thru 3.9 64bit, for Windows also 32bit.
The compatible documentation is updated as well on Read the Docs as linked on the home page.

@JorjMcKie
Copy link
Collaborator

There now also is support of radio-button groups - see here for an example script, that shows 4 images on a page.
Vieweing one of them will switch off the other three ...

@JorjMcKie JorjMcKie added the resolved fixed / implemented / answered label Nov 11, 2020
@ulfaslak
Copy link

I learned A LOT from following this issue. Thanks to everyone.

I have a very related problem and I wish for some help. It can either be solved either by rearranging the objects in OCGs of an existing PDF or by parsing two SVGs to PDFs (like @canedha I also have control over the SVG input) and combining them in separate layers of a new doc. The latter seemed the most straight forward and this is what I'm doing:

import fitz
from svglib.svglib import svg2rlg
from reportlab.graphics.renderPDF import drawToString

def svg_to_doc(path):
    """Using this function rather than `fitz`'  `convertToPDF` because the latter
    fills every shape with black for some reason.
    """
    drawing = svg2rlg(path)
    pdfbytes = drawToString(drawing)
    return fitz.open("pdf", pdfbytes)

# Create a new blank document
doc = fitz.open()
page = doc.new_page()

# Create "Graphics" and "ThroughCut" OCGs and get their `xref`s
xref_gr = doc.add_ocg('Graphics', on=True, intent=['View', 'Design'], usage='Artwork')
xref_tc = doc.add_ocg('ThroughCut', on=True, intent=['View', 'Design'], usage='Artwork')

# Load "graphics" and "cut lines" svgs and convert to pdf `doc`s
doc_gr = svg_to_doc("my_graphics_layer.svg")
doc_tc = svg_to_doc("my_throughcut_layer.svg")

# Set the `doc` dimensions
bb = doc_gr[0].rect
page.setMediaBox(bb)

# Put the docs in their respective OCGs
page.show_pdf_page(bb, doc_tc, 0, oc=xref_tc)
page.show_pdf_page(bb, doc_gr, 0, oc=xref_gr)

# Save
doc.save("output.pdf")

My problem is that these OCGs are not parsed as layers in Illustrator or other software that I have at handy. If I reload output.pdf in fitz, sure enough it has the right OCGs, but in Illustrator all shapes and objects are in one layer.

If I load output.pdf in a text editor I can search for, and find my "ThroughCut" and "Graphics" OCGs. But I don't understand the PDF syntax well enough to figure out why they are not showing in Illustrator.

my_graphics_layer.svg
my_throughcut_layer.svg

HUGE THANKS if you can spot an error I'm making, or point me towards a proper solution.

@canedha
Copy link

canedha commented Aug 18, 2021 via email

@JorjMcKie
Copy link
Collaborator

are the layers shown correctly in aby other program like Adobe Reader?

Yes, they are.

@canedha
Copy link

canedha commented Aug 18, 2021 via email

@ulfaslak
Copy link

I should obviously have tried Acrobat, and you are right, it does work, exactly as expected.

A little painful for me, because that means my problem makes even less sense now 😅.

@canedha
Copy link

canedha commented Aug 18, 2021 via email

@ulfaslak
Copy link

Sorry if this gets off topic, but this is a very important problem for me.

Another approach I might try is to rearrange content in OCGs of an existing PDF.

Is there a way to select curves (on some identifier, like stroke-width, color, or other) and move them to a new layer?

@JorjMcKie
Copy link
Collaborator

Is there a way to select curves (on some identifier, like stroke-width, color, or other) and move them to a new layer?

You can freely rearrange existing OCGs and OCMDs to other or new layers. This then affects everything to which those have been assigned. Specifically for OCMDs, you can change their behaviour in a plethora of ways, because they represent logical expressions about the ON/OFF state of OCGs.
And you can assign new or existing OCGs/OCMDs to existing images and xobjects.

You can assign OCGs/OCMDs to images, xobjects, text or drawings as part of their creation process.

But for text and drawings you can not assign OCGs once they have been created without OC relevance.

@ulfaslak
Copy link

Strong hints there. I'll give it a try. Thank you for the swift responses.

@canedha
Copy link

canedha commented Aug 19, 2021 via email

@ulfaslak
Copy link

Hi again! Not to beat a dead horse, but this problem is really turning into a hot potato for me.

Specifically, it's a real pain that OCGs are not visible as layers in Illustrator (though they are in Acrobat). Meanwhile I found this. It appears that if somehow the PDF document had "Preserve Illustrator Editing Capabilities" turned on it might work. Would this be an easy thing to implement within the package? I could image it being a useful feature for other people than just me.

@JorjMcKie
Copy link
Collaborator

I have read that thread meanwhile. This is still not technical detail enough:
In a PDF there is a thing called the catalog, documented here starting on page 71. It consists of several dozens of key-value pairs setting overall PDF properties - among them the item relevant for Optional Content ("OCProperties").
Details for the latter are then detailed starting on page 228 in that document.
At no place anywhere in that (authoritative) PDF specification any mentioning of Adobe Illustrator can be found.
If someone could tell me what exactly has to be specified to "Preserve Illustrator Editing Capabilities" in a PDF document, I am sure, there will be ways to achieve this.
Maybe a PDF example were this was switched on can be found ... could be a start.

@ulfaslak
Copy link

Absolutely. Using this code I generate a PDF file (I have posted this previously, don't mind the silly filenames).

import fitz
from svglib.svglib import svg2rlg
from reportlab.graphics.renderPDF import drawToString

def svg_to_doc(path):
    drawing = svg2rlg(path)
    pdfbytes = drawToString(drawing)
    return fitz.open("pdf", pdfbytes)

# Create a new blank document
doc = fitz.open()
page = doc.new_page()

# Create "Graphics" and "ThroughCut" OCGs and get their `xref`s
xref_gr = doc.add_ocg('Graphics', on=True, intent=['View', 'Design'], usage='Artwork')
xref_tc = doc.add_ocg('ThroughCut', on=True, intent=['View', 'Design'], usage='Artwork')

# Load "graphics" and "cut lines" svgs and convert to pdf `doc`s
doc_gr = svg_to_doc("freebrit4_graphics.svg")
doc_tc = svg_to_doc("freebrit4_throughcut.svg")

# Set the `doc` dimensions
bb = doc_gr[0].rect
page.setMediaBox(bb)

# Put the docs in their respective OCGs
page.show_pdf_page(bb, doc_tc, 0, oc=xref_tc)
page.show_pdf_page(bb, doc_gr, 0, oc=xref_gr)

# Save
doc.save("pdfkit_export.pdf")

Here is that file: pdfkit_export.pdf

If I open it in Illustrator, here is the parsed layer structure:

Screenshot 2021-08-27 at 12 04 33

Now, I can edit this file in Illustrator so it has the desired layers and save it as a new file
illustrator_export.pdf with the layer structure:

Screenshot 2021-08-27 at 11 59 45

So being able to generate something like illustrator_export.pdf directly from PyMuPDF would be fantastic.

PS: Many thanks for taking the time to consider this problem. I maintain projects myself and know how time consuming and ungrateful it can sometimes be. Please do not think I expect ASAP responses or full problem fixes at all.

@JorjMcKie
Copy link
Collaborator

Nice of you to say that!

But I am confused what I am actually seeing here:

  • Input and output look exactly equal initially when opened in any viewer.
  • The input PDF lets me switch on and off the ThroughCut and Graphics separately.
  • The output does show the same OC structure. Also on the low PDF level. But nothing visible happens when the two layers are switched. Is this the desired / intended illustrator effect? I observed that Illustrator did a lot of things transforming the former XObjects (your ex-SVGs) into something else, but otherwise ...?

@JorjMcKie
Copy link
Collaborator

But maybe we talk past each other:
Do you really want to import SVG drawings in another way than show_pdf_page()? To not have a muli-level of stuff as shown in the first image?
If that is the case, there are solutions, too, let me know. It basically works using extract the SVG content (drawings and maybe text or image) directly and re-insert this stuff on a PDF page (potentially assigning OCGs).

@ulfaslak
Copy link

Sorry about the confusion. I should say that the reason I need to arrive at this very specific format is that the PDF file must get loaded in a special type of printer that can also cut. The machine knows where to cut by looking inside the "ThroughCut" layer and taking the curves to be the cut lines. The machine manufacturer disclosed that they use the proprietary Adobe PDF Library for parsing PDF. Their (very unsatisfying) recommendation is that I put PDFs together so they get parsed correctly in Illustrator, then it is guaranteed to work in their software.

  • Input and output look exactly equal initially when opened in any viewer.
  • The input PDF lets me switch on and off the ThroughCut and Graphics separately.

Indeed they look the same in Acrobat. Probably also other viewers. But not in editors, like Illustrator.

  • The output does show the same OC structure. Also on the low PDF level. But nothing visible happens when the two layers are switched. Is this the desired / intended illustrator effect? I observed that Illustrator did a lot of things transforming the former XObjects (your ex-SVGs) into something else, but otherwise ...?

There should be nothing visibly different between pdfkit_export.pdf and illustrator_export.pdf. The ThroughCut layer should be invisible and only gets used for parsing cut-lines. So Illustrator is not really a part of the equation other than being a reference for whether the PDF can get parsed correctly by the machine.

But maybe we talk past each other:
Do you really want to import SVG drawings in another way than show_pdf_page()? To not have a muli-level of stuff as shown in the first image?
If that is the case, there are solutions, too, let me know. It basically works using extract the SVG content (drawings and maybe text or image) directly and re-insert this stuff on a PDF page (potentially assigning OCGs).

Well, this is probably secondary to the problem described above, but what would be really neat was if I could input just a single SVG that looked something like

<svg>
    <g id='ThroughCut'>
        ...
    </g>
    <g id='Graphics'>
        ...
    </g>
</svg>

And convert that into a PDF with each g as a separate layer.

@JorjMcKie
Copy link
Collaborator

Ah, ok, I understand better now - I think.
Your import of SVG files into a PDF via show_pdf_page() then was just the only way at hand to do that kind of thing at all.

But there does exist another way: page.get_drawings(). This method is available for all document types - including SVG.
It extracts drawing objects from the page converting each into a Python dict, that is compatible with PyMuPDF's Shape class:
Each such dict - which I call a "path", following PDF terminology - contains a list of elementary draw commands like lines, curves, rectangles ... together with common properties like color, line dashing and what not.

So you could extract the drawings of an SVG page and re-draw them on a PDF page using the Shape methods. Each single path in Shape can also be given an OCG (or OCMD) number. As long as we know that we are processing ThroughCut data, we can easily assign an OCG / OCMD with OFF status ...
This approach would avoid the extra hierarchy level introduced by show_pdf_page().
If you want to consider this, let me have SVG file examples and I will experiment a bit with them and let you have the scripts.

@canedha
Copy link

canedha commented Aug 28, 2021 via email

@JorjMcKie
Copy link
Collaborator

What would be the shape drawing command?

See this script, which is also contained in the documentation.
The "some.file" can be PDF, XPS, EPUB, SVG, ... just any file supported by (Py-) MuPDF. Together with page.get_text(), which also works in all these cases, these belong to MuPDF's differentiators compared to other document processing libraries (which usually only digest PDFs).

So my suggestion from above amounts to:
Instead of converting an SVG to PDF, and then import its (single) page into a new PDF, one could instead directly copy over the draw commands.

Of course that means copying the draw commands only - not any other content types. However, as mentioned above, re-creating text and images is possible with the output of page.get_text("dict").

    import fitz
    doc = fitz.open("some.file")
    page = doc[0]
    paths = page.get_drawings()  # extract existing drawings
    # this is a list of "paths", which can directly be drawn again using Shape
    # -------------------------------------------------------------------------
    #
    # define some output page with the same dimensions
    outpdf = fitz.open()
    outpage = outpdf.new_page(width=page.rect.width, height=page.rect.height)
    shape = outpage.new_shape()  # make a drawing canvas for the output page
    # --------------------------------------
    # loop through the paths and draw them
    # --------------------------------------
    for path in paths:
        # ------------------------------------
        # draw each entry of the 'items' list
        # ------------------------------------
        for item in path["items"]:  # these are the draw commands
            if item[0] == "l":  # line
                shape.draw_line(item[1], item[2])
            elif item[0] == "re":  # rectangle
                shape.draw_rect(item[1])
            elif item[0] == "qu":  # quad
                shape.draw_rect(item[1])
            elif item[0] == "c":  # curve
                shape.draw_bezier(item[1], item[2], item[3], item[4])
            else:
                raise ValueError("unhandled drawing", item)
        # ------------------------------------------------------
        # all items are drawn, now apply the common properties
        # to finish the path
        # ------------------------------------------------------
        shape.finish(
            fill=path["fill"],  # fill color
            color=path["color"],  # line color
            dashes=path["dashes"],  # line dashing
            even_odd=path.get("even_odd", True),  # control color of overlaps
            closePath=path["closePath"],  # whether to connect last and first point
            lineJoin=path["lineJoin"],  # how line joins should look like
            lineCap=max(path["lineCap"]),  # how line ends should look like
            width=path["width"],  # line width
            stroke_opacity=path.get("stroke_opacity", 1),  # same value for both
            fill_opacity=path.get("fill_opacity", 1),  # opacity parameters
            )
    # all paths processed - commit the shape to its page
    shape.commit()
    outpdf.save("drawings-page-0.pdf")

extract from pdf and insert in svg also work?

Yes, in principle. You are largely on your own here when it comes to convert PDF page drawings only, but there are contributions from someone in these discussion threads.
If you want to convert a complete page to an SVG, then this is no problem at all: just use page.get_svg_image(). Again, all document types are supported as input page.

@canedha
Copy link

canedha commented Sep 7, 2021

@JorjMcKie
I would like to understand the structure of pdf better. I googled quite a bit, but couldn't find anything good. Do you have any recommendations on a comprehensive overview or quick intro into the structure? I know there's a pdf spec, but 740 pages is a bit out of proportion to just understand the basic structure...

@JorjMcKie
Copy link
Collaborator

@canedha - I have no simple PDF overview description at hand. I always used the full stuff 😉.
I would recommend to really start with the manual's first chapters. I would use the older version https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
There is Chapter 3 Syntax starting on p 47. Within this paragraphs file structure on p 90, document structure on p 137.
That should give you a good start.
Also do study and understand the many example PDFs in that manual. This really does help a lot.

@canedha
Copy link

canedha commented Sep 7, 2021

@canedha - I have no simple PDF overview description at hand. I always used the full stuff 😉.
I would recommend to really start with the manual's first chapters. I would use the older version https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf.
There is Chapter 3 Syntax starting on p 47. Within this paragraphs file structure on p 90, document structure on p 137.
That should give you a good start.
Also do study and understand the many example PDFs in that manual. This really does help a lot.

Thanks Jorj, I'll give it a look!
Maybe one more question:
The ocg information containing the layers, where/which level is this stored in the pdf? as far as I understood its stored within each page?

@JorjMcKie
Copy link
Collaborator

There is an optional entry in the central PDF catalog (page-independent): OCProperties.
It contains a list of all OCGs of the file. It also always contains 1 standard layer (under key /D), which lists the ON/OFF states of those OCGs.
There may be more layer definitions, for activation in the UI of a supporting PDF viewer, which then temporarily take over the role of /D.
All this is independent from, if (at all) and where this information is actually used elsewhere in the file.
Each image, form xobject, drawing path and text object can then be associated with an OCG (or, equivalently an OCMD), and will be displayed or not depending on the state of that object.
A page as such cannot be associated with an OCG/OCMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question resolved fixed / implemented / answered
Projects
None yet
Development

No branches or pull requests

4 participants