-
Notifications
You must be signed in to change notification settings - Fork 510
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully embedded font is extracted only partially if it occupies more than one objects #2110
Comments
This seems to be a duplicate of #2109 - however I am not really sure. |
Please delete 2109 - I tried to correct it while it was on its way to the server and didn't realize it had already been created. Sorry about that. I had also written my expectation - but got deleted somehow during my writing... My expectation is that PyMuPDF assembles the three pieces into one. If that's too difficult (or not possible due to limitations of mupdf/mutool), at the very minimum use suffixes to write the various pieces in their own files, e.g. Times-Roman-1.cff, Times-Roman-2.cff, Times-Roman-3.cff. I cannot provide an even more clear description of the result than what I have already done.
Then it tells the user that "all 16" fonts have been extracted, but the user sees only, say, 11 - because some .cff files were overwritten in the extraction process. I will try to send you the PDF to your outlook account, as I cannot put it publicly here. |
BTW, Times Roman object 1/2/3 are not fully embedded Times-Roman fonts. They are parts of the font that together form a fully embedded font. So overwriting Times-Roman.cff each time with the next object that happens to say "I keep data for 'Times-Roman' font" destroys parts of the font that we want to extract. |
This is either not possible or clearly beyond the intended scope of PyMuPDF. Features like this one should be looked for in dedicated font packages like fontTools. What I suspect is really your problem instead: You extract font names without their subset identifier For the time being, I will convert this post from an issue to a "Discussions" item. |
Fix #2110 (Discussion item #2111): File `__main__.py` - also include the font's xref in the generated file name. Fix #2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix #2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
Fix pymupdf#2110 (Discussion item pymupdf#2111): File `__main__.py` - also include the font's xref in the generated file name. Fix pymupdf#2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix pymupdf#2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
Fix #2110 (Discussion item #2111): File `__main__.py` - also include the font's xref in the generated file name. Fix #2094: File `helper-device.i' - also ensure equality of x coordinates of relevant corners before assuming a rectangle. Fix #2087: File `fitz.i`- if JPX image format is already known, make sure to read the decoded image stream, instead of raw stream in the other cases.
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Description
I have a PDF where the PDF reader was telling me that some fonts were fully embedded, so I decided to test my freshly installed PyMuPDF 1.21.0-rc2 on it. It told me that it "saved 16 fonts to" my working directory, but looking there revealed only 11 of them. Looking at the list of fonts from pdffonts, it became clear why: some fonts that were "fully" embedded were using more than one objects to store partial pieces of them. A check with mupdf confirmed that PyMuPDF extracted only the last object onto font-name.cff, probably because the output file (font-name.cff) was the same for all pieces/objects of font with name font-name.
How To Reproduce
You must have one of those PDFs that embed a font fully by storing pieces of it in multiple objects. Let's say x.pdf is one of them. pdffonts lists the fonts of x.pdf as follows:
You can see that, for example, Times-Roman has 'no' in the "sub" (subsetted) column, meaning it is NOT subsetted - therefore it is fully embedded. You can also see that it occupied objects with numbers 567, 569 and 306.
mutool shows practically the same with its 'info' command:
We could use mutool extract x.pdf, but that is not user-friendly, as it a) extracts both all fonts and all images and b) it extracts fonts as font-XXXX.cff (or, possibly, font-XXXX.ttf), where XXXX has no relation to the font object number (contrary to what its documentation claims), so you practically don't know which file is which font, unless you open each one of them, or at least read its metadata somehow.
Enter PyMuPDF which promises to a) extract all fonts and b) give the extracted files sensible names.Alas, trying it on our x.pdf results on just 11 fonts - contrary to the claimed 16:
(first column is file size)
What has happened? Comparing this to the output of mutool extract
and looking carefully at the file sizes (first column), we see that Times-Roman.cff (the Times-Roman font as extracted by PyMuPDF) is exactly font-0599.cff (a font extracted by mutool, whose object number is NOT 599 (there is no font object with such a number in x.pdf)) - but this is only one of the three pieces (objects) that store parts of Times-Roman!
Your configuration (mandatory)
More precisely:
The text was updated successfully, but these errors were encountered: