Skip to content
This repository was archived by the owner on Jun 18, 2025. It is now read-only.
This repository was archived by the owner on Jun 18, 2025. It is now read-only.

cmap encoding selection: unicodeEncoding vs. microsoftUCS4Encoding  #44

Open
@xianpingge

Description

@xianpingge

I'm dumping the glyphs from HanaMinB.ttf
( available at

https://osdn.net/frs/redir.php?m=pumath&f=%2Fhanazono-font%2F64385%2Fhanazono-20160201.zip

), where most of the characters are > U+FFFF.

Enclosed please find the output of
ttfdump -t cmap HanaMinB.ttf

According to the ttfdump output, this ttf file contains 4 cmap
subtables, covering the 4 encodings defined in truetype.go:

unicodeEncoding = 0x00000003 // PID = 0 (Unicode), PSID = 3 (Unicode 2.0)
microsoftSymbolEncoding = 0x00030000 // PID = 3 (Microsoft), PSID = 0 (Symbol)
microsoftUCS2Encoding = 0x00030001 // PID = 3 (Microsoft), PSID = 1 (UCS-2)
microsoftUCS4Encoding = 0x0003000a // PID = 3 (Microsoft), PSID = 10 (UCS-4)

And the current code selects the first one (unicodeEncoding):

pidPsid := u32(table, offset)
// We prefer the Unicode cmap encoding. Failing to find that, we fall
// back onto the Microsoft cmap encoding.
if pidPsid == unicodeEncoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
break
} else if pidPsid == microsoftSymbolEncoding ||
pidPsid == microsoftUCS2Encoding ||
pidPsid == microsoftUCS4Encoding {
bestOffset, bestPID, ok = offset, pidPsid>>16, true
// We don't break out of the for loop, so that Unicode can override Microsoft.
}

and none of the >U+FFFF characters are available.

Should we prefer microsoftUCS4Encoding to the
16-bit-only unicodeEncoding ?

HanaMinB.ttf-dump-cmap.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions