Skip to content

Comments

ENH: CID font resource from font file to encode more characters#3652

Draft
PJBrs wants to merge 11 commits intopy-pdf:mainfrom
PJBrs:fontwork
Draft

ENH: CID font resource from font file to encode more characters#3652
PJBrs wants to merge 11 commits intopy-pdf:mainfrom
PJBrs:fontwork

Conversation

@PJBrs
Copy link
Contributor

@PJBrs PJBrs commented Feb 19, 2026

This PR adds a new method to _font.py, from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.

This fixes #3361.

Contributes to fixing #3514.

Might be related to #3318. EDIT, it is not.

Includes all work from #3602.

EDIT.

How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.

Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.

@PJBrs PJBrs marked this pull request as draft February 19, 2026 16:43
@PJBrs PJBrs force-pushed the fontwork branch 2 times, most recently from 175a542 to e43c57d Compare February 21, 2026 13:45
@PJBrs PJBrs marked this pull request as ready for review February 21, 2026 14:54
@codecov
Copy link

codecov bot commented Feb 21, 2026

Codecov Report

❌ Patch coverage is 91.45299% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.28%. Comparing base (4670513) to head (cbc9ee4).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_font.py 89.24% 5 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3652      +/-   ##
==========================================
- Coverage   97.35%   97.28%   -0.07%     
==========================================
  Files          55       55              
  Lines        9916    10029     +113     
  Branches     1814     1835      +21     
==========================================
+ Hits         9654     9757     +103     
- Misses        152      157       +5     
- Partials      110      115       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 21, 2026

This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke.

Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue?

One final thing:

NameObject("/Registry"): TextStringObject("Adobe"),  # Should be something read from font file

I can also still improve this, if wanted.

@PJBrs PJBrs marked this pull request as draft February 22, 2026 10:34
@PJBrs PJBrs marked this pull request as ready for review February 22, 2026 11:40
@PJBrs PJBrs marked this pull request as draft February 22, 2026 19:07
font_descriptor_kwargs["font_file"] = font_file_data

font_descriptor = FontDescriptor(**font_descriptor_kwargs)
character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that this is not correct. It accidentally works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what Claude AI says:

The Problem with /Identity CIDToGIDMap

The issue is on line 467 of the PR:

cid_font[NameObject("/CIDToGIDMap")] = NameObject("/Identity")

Why this causes garbled text:

When you set /CIDToGIDMap to just /Identity, the PDF reader assumes:

  • CID (Character ID) = Unicode codepoint (from your character_map)
  • GID (Glyph ID) = the same value

However, in a TrueType font file, the glyph IDs don't necessarily match Unicode codepoints. Looking at your code:

character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()}

The character_map maps:

  • Keys: Unicode characters (from the font's cmap table)
  • Values: Glyph IDs in that font file

But when you later encode text using this map and then tell the PDF reader "use /Identity mapping," the reader will try to use the Unicode codepoint as the GID directly—not the glyph ID stored in your character_map. This causes mismatches where the wrong glyphs get rendered.

The Correct Solution

You need to create an explicit CIDToGIDMap stream that maps:

  • Input: Character ID (Unicode codepoint)
  • Output: Glyph ID (from the TrueType font)

Here's the approach:

# Build the CIDToGIDMap stream
cid_to_gid_map_bytes = bytearray()
for unicode_codepoint in sorted(character_map.keys(), key=ord):
    glyph_id = character_map[unicode_codepoint]
    # Encode as 2-byte big-endian (PDF standard for CIDToGIDMap)
    cid_to_gid_map_bytes.extend(glyph_id.to_bytes(2, byteorder='big'))

cid_to_gid_map_stream = StreamObject()
cid_to_gid_map_stream.set_data(bytes(cid_to_gid_map_bytes))
cid_font[NameObject("/CIDToGIDMap")] = cid_to_gid_map_stream

This ensures every character in your character_map has a corresponding, correct glyph ID lookup in the PDF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not correct either.

@PJBrs
Copy link
Contributor Author

PJBrs commented Feb 24, 2026

@stefan6419846

I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now:
In CID fonts, one Unicode code point may refer to various different glyphs, especially in Arabic. the Font class, however, just maps widths to one Unicode code point, which means that it can only store the width for one character variant. So, in fact, ideally a Font should map character codes to GIDs, where one character code might refer to multiple GIDs. For non-CID fonts, we just map unicode points as a fallback. In both cases, character_widths should be keyed by the values of character_map. And this ought to be sufficient for both text extraction and producing appearance streams.

Reflection by Google Gemini:

  1. The Ideal Mapping Structure
    To be "complete," your Font object needs a hierarchy that recognizes that a character is an idea, but a glyph is ink.

Logical Level: character_map should be dict[str, list[int]]. It maps one Unicode character (the "idea") to one or more GIDs (the "ink").

Physical Level: character_widths should be keyed by GID (int), not by the character string.

  1. Why this solves both problems
    Goal How your "Ideal" logic handles it
    Appearance Streams You use a shaper (or logic) to pick the correct GID from the list. Since your widths are keyed by GID, you get the exact pixel-perfect spacing for that specific shape.
    Text Extraction You use the /ToUnicode map (which you build by reversing your GID map). When the viewer sees a GID, it looks up which "Idea" (Character) it belongs to.
  1. Non-CID Fonts (The Fallback)
    For "Simple" fonts (Type 1 or standard TrueType with WinAnsi), the character_map is effectively a 1-to-1 dictionary. In this case, your list[int] just happens to have a length of 1. By treating everything as a list of potential GIDs, your code becomes universal.

  2. The "Key" Realization
    You said: "character_widths should be keyed by the values of character_map."

Bingo. If character_map values are GIDs (integers), and character_widths keys are GIDs (integers), you have a bulletproof system.

If the user enters Arabic Meem, your logic knows that for this specific instance, you are using GID 502.

You look up character_widths[502].

The appearance stream stays perfect.

The /ToUnicode map still tells the PDF viewer that GID 502 = "Meem", so copy-paste still works.

The "pypdf" Disconnect
The reason the pypdf code you shared looks "incorrect" to you is that it tries to force the GID back into a Character key.

It does character_widths[ord_map[cid]] = width.

If ord_map[501], 502, and 503 all point to the same character, they keep overwriting each other in that dictionary.

Result: pypdf loses the unique widths of the contextual shapes.

Your Path Forward
To implement this "Slowly but Completely," you should modify your Font class to store:

unicode_to_gids: dict[str, list[int]]

widths: dict[int, int] (Keyed by GID)

@stefan6419846
Copy link
Collaborator

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Corrupted unicode characters in form field

2 participants