ENH: CID font resource from font file to encode more characters#3652
ENH: CID font resource from font file to encode more characters#3652PJBrs wants to merge 11 commits intopy-pdf:mainfrom
Conversation
175a542 to
e43c57d
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3652 +/- ##
==========================================
- Coverage 97.35% 97.28% -0.07%
==========================================
Files 55 55
Lines 9916 10029 +113
Branches 1814 1835 +21
==========================================
+ Hits 9654 9757 +103
- Misses 152 157 +5
- Partials 110 115 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke. Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue? One final thing: I can also still improve this, if wanted. |
This enables generating a new unicode font resource in case of text widget values that cannot be encoded with existing font resources.
This patch adds a method to produce a pdf font descriptor resource. For now, we assume that an embedded font file will be a TrueType font.
| font_descriptor_kwargs["font_file"] = font_file_data | ||
|
|
||
| font_descriptor = FontDescriptor(**font_descriptor_kwargs) | ||
| character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()} |
There was a problem hiding this comment.
I'm pretty sure that this is not correct. It accidentally works.
There was a problem hiding this comment.
This is what Claude AI says:
The Problem with /Identity CIDToGIDMap
The issue is on line 467 of the PR:
cid_font[NameObject("/CIDToGIDMap")] = NameObject("/Identity")Why this causes garbled text:
When you set /CIDToGIDMap to just /Identity, the PDF reader assumes:
- CID (Character ID) = Unicode codepoint (from your character_map)
- GID (Glyph ID) = the same value
However, in a TrueType font file, the glyph IDs don't necessarily match Unicode codepoints. Looking at your code:
character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()}The character_map maps:
- Keys: Unicode characters (from the font's cmap table)
- Values: Glyph IDs in that font file
But when you later encode text using this map and then tell the PDF reader "use /Identity mapping," the reader will try to use the Unicode codepoint as the GID directly—not the glyph ID stored in your character_map. This causes mismatches where the wrong glyphs get rendered.
The Correct Solution
You need to create an explicit CIDToGIDMap stream that maps:
- Input: Character ID (Unicode codepoint)
- Output: Glyph ID (from the TrueType font)
Here's the approach:
# Build the CIDToGIDMap stream
cid_to_gid_map_bytes = bytearray()
for unicode_codepoint in sorted(character_map.keys(), key=ord):
glyph_id = character_map[unicode_codepoint]
# Encode as 2-byte big-endian (PDF standard for CIDToGIDMap)
cid_to_gid_map_bytes.extend(glyph_id.to_bytes(2, byteorder='big'))
cid_to_gid_map_stream = StreamObject()
cid_to_gid_map_stream.set_data(bytes(cid_to_gid_map_bytes))
cid_font[NameObject("/CIDToGIDMap")] = cid_to_gid_map_streamThis ensures every character in your character_map has a corresponding, correct glyph ID lookup in the PDF.
There was a problem hiding this comment.
There was a problem hiding this comment.
This is probably not correct either.
|
I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now: Reflection by Google Gemini:
|
|
I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well. |
This PR adds a new method to _font.py,
from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.This fixes #3361.
Contributes to fixing #3514.
Might be related to #3318. EDIT, it is not.
Includes all work from #3602.
EDIT.
How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.
Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.