ENH: CID font resource from font file to encode more characters by PJBrs · Pull Request #3652 · py-pdf/pypdf

PJBrs · 2026-02-19T16:42:34Z

This PR adds a new method to _font.py, from_truetype_font_file, which initialises a Font instance from an embedded font file. I'm assuming that this might also work with a real file. Furthermore, it adds a lot of information to as_font_resource, to enable producing a CID TrueType font resource that enables encoding more characters than a TrueType font resource.

This fixes #3361.

Contributes to fixing #3514.

Might be related to #3318. EDIT, it is not.

Includes all work from #3602.

EDIT.

How it works:
We detect if a text value for a text widget annotation can be encoded using an existing font resource. If not, and we have an embedded TrueType font, we assume that we are expected to create a new font resource. We use the embedded font file to initialise a new Font instance, and then produce a new font resource from this instance. After having done so, we make the associated font descriptor an indirect object later on, as per the PDF specification.

Some notes:
I think that the more elegant way would be produce a short embedded font resource with only the characters in the text value. Also, it should have been possible to reuse the original font descriptor, but I can't seem to make that work.

codecov · 2026-02-21T15:05:42Z

Codecov Report

❌ Patch coverage is 91.45299% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.28%. Comparing base (4670513) to head (cbc9ee4).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/_font.py	89.24%	5 Missing and 5 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3652      +/-   ##
==========================================
- Coverage   97.35%   97.28%   -0.07%     
==========================================
  Files          55       55              
  Lines        9916    10029     +113     
  Branches     1814     1835      +21     
==========================================
+ Hits         9654     9757     +103     
- Misses        152      157       +5     
- Partials      110      115       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PJBrs · 2026-02-21T15:17:11Z

This pull request is now ready for review. It seems to have failed some tests, but since it passed these earlier, I'm going to assume that that's a fluke.

Codecov shows that quite some new code is not covered by tests. This is mostly because I tried to parse all sources for applicable font flags in the font descriptor, and the file that I tested has only one font. To really test this code, we should read multiple real truetype fonts from file to see if they parse correctly. That, however, would seem, to me, to be beyond the purposes of this PR. Conversely, it would seem a shame to me not to parse these flags. How should I continue?

One final thing:

NameObject("/Registry"): TextStringObject("Adobe"),  # Should be something read from font file

I can also still improve this, if wanted.

This enables generating a new unicode font resource in case of text widget values that cannot be encoded with existing font resources.

This patch adds a method to produce a pdf font descriptor resource. For now, we assume that an embedded font file will be a TrueType font.

PJBrs · 2026-02-22T19:08:11Z

pypdf/_font.py

+            font_descriptor_kwargs["font_file"] = font_file_data
+
+            font_descriptor = FontDescriptor(**font_descriptor_kwargs)
+            character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()}


I'm pretty sure that this is not correct. It accidentally works.

This is what Claude AI says:

The Problem with /Identity CIDToGIDMap

The issue is on line 467 of the PR:

cid_font[NameObject("/CIDToGIDMap")] = NameObject("/Identity")

Why this causes garbled text:

When you set /CIDToGIDMap to just /Identity, the PDF reader assumes:

CID (Character ID) = Unicode codepoint (from your character_map)

GID (Glyph ID) = the same value

However, in a TrueType font file, the glyph IDs don't necessarily match Unicode codepoints. Looking at your code:

character_map = {chr(key): value for key, value in tt_font_object.getBestCmap().items()}

The character_map maps:

Keys: Unicode characters (from the font's cmap table)

Values: Glyph IDs in that font file

But when you later encode text using this map and then tell the PDF reader "use /Identity mapping," the reader will try to use the Unicode codepoint as the GID directly—not the glyph ID stored in your character_map. This causes mismatches where the wrong glyphs get rendered.

The Correct Solution

You need to create an explicit CIDToGIDMap stream that maps:

Input: Character ID (Unicode codepoint)

Output: Glyph ID (from the TrueType font)

Here's the approach:

# Build the CIDToGIDMap stream cid_to_gid_map_bytes = bytearray() for unicode_codepoint in sorted(character_map.keys(), key=ord): glyph_id = character_map[unicode_codepoint] # Encode as 2-byte big-endian (PDF standard for CIDToGIDMap) cid_to_gid_map_bytes.extend(glyph_id.to_bytes(2, byteorder='big')) cid_to_gid_map_stream = StreamObject() cid_to_gid_map_stream.set_data(bytes(cid_to_gid_map_bytes)) cid_font[NameObject("/CIDToGIDMap")] = cid_to_gid_map_stream

This ensures every character in your character_map has a corresponding, correct glyph ID lookup in the PDF.

https://stackoverflow.com/questions/75576696/understanding-pdf-cidfonts-cmaps-and-gids-best-practices

https://ken-lunde.medium.com/to-cid-or-not-to-cid-e8e623dcde92

This is probably not correct either.

PJBrs · 2026-02-24T20:21:01Z

@stefan6419846

I clearly must need to learn more about fonts in order to get this PR sufficient. I've learnt the following now:
In CID fonts, one Unicode code point may refer to various different glyphs, especially in Arabic. the Font class, however, just maps widths to one Unicode code point, which means that it can only store the width for one character variant. So, in fact, ideally a Font should map character codes to GIDs, where one character code might refer to multiple GIDs. For non-CID fonts, we just map unicode points as a fallback. In both cases, character_widths should be keyed by the values of character_map. And this ought to be sufficient for both text extraction and producing appearance streams.

Reflection by Google Gemini:

The Ideal Mapping Structure
To be "complete," your Font object needs a hierarchy that recognizes that a character is an idea, but a glyph is ink.

Logical Level: character_map should be dict[str, list[int]]. It maps one Unicode character (the "idea") to one or more GIDs (the "ink").

Physical Level: character_widths should be keyed by GID (int), not by the character string.

Why this solves both problems
Goal How your "Ideal" logic handles it
Appearance Streams You use a shaper (or logic) to pick the correct GID from the list. Since your widths are keyed by GID, you get the exact pixel-perfect spacing for that specific shape.
Text Extraction You use the /ToUnicode map (which you build by reversing your GID map). When the viewer sees a GID, it looks up which "Idea" (Character) it belongs to.

Non-CID Fonts (The Fallback)
For "Simple" fonts (Type 1 or standard TrueType with WinAnsi), the character_map is effectively a 1-to-1 dictionary. In this case, your list[int] just happens to have a length of 1. By treating everything as a list of potential GIDs, your code becomes universal.

The "Key" Realization
You said: "character_widths should be keyed by the values of character_map."

Bingo. If character_map values are GIDs (integers), and character_widths keys are GIDs (integers), you have a bulletproof system.

If the user enters Arabic Meem, your logic knows that for this specific instance, you are using GID 502.

You look up character_widths[502].

The appearance stream stays perfect.

The /ToUnicode map still tells the PDF viewer that GID 502 = "Meem", so copy-paste still works.

The "pypdf" Disconnect
The reason the pypdf code you shared looks "incorrect" to you is that it tries to force the GID back into a Character key.

It does character_widths[ord_map[cid]] = width.

If ord_map[501], 502, and 503 all point to the same character, they keep overwriting each other in that dictionary.

Result: pypdf loses the unique widths of the contextual shapes.

Your Path Forward
To implement this "Slowly but Completely," you should modify your Font class to store:

unicode_to_gids: dict[str, list[int]]

widths: dict[int, int] (Keyed by GID)

stefan6419846 · 2026-02-25T10:33:01Z

I am clearly no font expert either, thus having someone who really likes to dig into the oddities here is highly appreciated. From a pypdf point of view, having this would be great, but for now, we (luckily) do not expose this functionality as public API, thus preventing us from having to consider backwards-compatibility as well.

PJBrs marked this pull request as draft February 19, 2026 16:43

PJBrs force-pushed the fontwork branch 2 times, most recently from 175a542 to e43c57d Compare February 21, 2026 13:45

PJBrs marked this pull request as ready for review February 21, 2026 14:54

PJBrs force-pushed the fontwork branch from 5c88abc to 82ebd03 Compare February 21, 2026 14:58

PJBrs force-pushed the fontwork branch from 5147bb6 to 160d8d5 Compare February 21, 2026 15:37

Extract the /FontFile and store it in the new FileDescriptor object

a04e579

PJBrs force-pushed the fontwork branch from 160d8d5 to cf9b10e Compare February 21, 2026 15:50

PJBrs added 4 commits February 22, 2026 11:18

ENH: Font: Enable initialisation from TrueType font file

5ef6898

MAINT: Font: Refactor space width calculation

f4bdfcc

ENH: Font: Enable generating a CID font resource

d67f393

ENH: AppearanceStream: Generate new font resource for unicode

b151e8f

This enables generating a new unicode font resource in case of text widget values that cannot be encoded with existing font resources.

PJBrs marked this pull request as draft February 22, 2026 10:34

PJBrs force-pushed the fontwork branch from 22d3b84 to 6654750 Compare February 22, 2026 11:13

PJBrs added 5 commits February 22, 2026 12:16

ENH: FontDescriptor: Add method to produce PDF resource

c813d10

This patch adds a method to produce a pdf font descriptor resource. For now, we assume that an embedded font file will be a TrueType font.

ENH: Font: Add our own font descriptor resource

714bfa9

ENH: PdfWriter: Make font descriptors indirect when filling forms

1172dc8

ENH: PdfWriter: Test adding unicode font resource for form filling

f1dbe29

ENH: Test writer: Test for unavailable unicode characters

663e7b3

PJBrs force-pushed the fontwork branch from 6654750 to 5b3cd93 Compare February 22, 2026 11:17

ENH: Test font: Simple check for font descriptor as resource

cbc9ee4

PJBrs force-pushed the fontwork branch from 5b3cd93 to cbc9ee4 Compare February 22, 2026 11:26

PJBrs marked this pull request as ready for review February 22, 2026 11:40

PJBrs marked this pull request as draft February 22, 2026 19:07

PJBrs commented Feb 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ENH: CID font resource from font file to encode more characters#3652

ENH: CID font resource from font file to encode more characters#3652
PJBrs wants to merge 11 commits intopy-pdf:mainfrom
PJBrs:fontwork

PJBrs commented Feb 19, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 21, 2026 •

edited

Loading

Uh oh!

PJBrs commented Feb 21, 2026 •

edited

Loading

Uh oh!

PJBrs Feb 22, 2026

Uh oh!

PJBrs Feb 23, 2026

Uh oh!

PJBrs Feb 23, 2026

Uh oh!

PJBrs Feb 24, 2026

Uh oh!

PJBrs commented Feb 24, 2026 •

edited

Loading

Uh oh!

stefan6419846 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

PJBrs commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

PJBrs commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PJBrs Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

PJBrs Feb 23, 2026

Choose a reason for hiding this comment

The Problem with /Identity CIDToGIDMap

The Correct Solution

Uh oh!

PJBrs Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

PJBrs Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

PJBrs commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefan6419846 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PJBrs commented Feb 19, 2026 •

edited

Loading

codecov bot commented Feb 21, 2026 •

edited

Loading

PJBrs commented Feb 21, 2026 •

edited

Loading

The Problem with `/Identity` CIDToGIDMap

PJBrs commented Feb 24, 2026 •

edited

Loading