Skip to content
36 changes: 23 additions & 13 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,31 @@
The goals and overview of this package can be found in the README.md file,
start by reading that.

When troubleshooting, write Go tests instead of executable
Go debugging scripts. The tests can return whatever logs or output
you need. If those tests are only for temporary troubleshooting,
clean them up after the debugging is done.
The goal of this package is to determine the display (column) width of a
string, UTF-8 bytes, or runes, as would happen in a monospace font, especially
in a terminal.

Separate executable debugging scripts are messy, tend to have conflicting
dependencies and are hard to cleanup.
When troubleshooting, write Go unit tests instead of executing debug scripts.
The tests can return whatever logs or output you need. If those tests are
only for temporary troubleshooting, clean up the tests after the debugging is
done.

If you make changes to the trie generation code, it can be invoked by running
`go generate` from the top package directory.
(Separate executable debugging scripts are messy, tend to have conflicting
dependencies and are hard to cleanup.)

We have hard-coded some exceptions to achieve compatibility with go-runewidth.
We consider them technical debt. One example is isExceptionalCombiningMark.
Ideally, we would not have these exceptional cases. Our current theory in the
case of isExceptionalCombiningMark is that go-runewidth is incorrect, but we
don't know for sure.
If you make changes to the trie generation in internal/gen, it can be invoked
by running `go generate` from the top package directory.

## Pull Requests

For PRs, you can use the gh CLI tool to retrieve or post comments.

## Comparisons to go-runewidth

We originally attempted to make this package compatible with go-runewidth.
However, we found that there were too many differences in the handling of
certain characters and properties.

We believe, preliminarily, that our choices are more correct and complete,
by using more complete categories such as Unicode Cf (format) for zero-width
and Mn (Nonspacing_Mark) for combining marks.
58 changes: 36 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,28 +71,42 @@ goos: darwin
goarch: arm64
pkg: github.com/clipperhouse/displaywidth
cpu: Apple M2
BenchmarkStringDefault/displaywidth-8 96490 10552 ns/op 159.88 MB/s 0 B/op 0 allocs/op
BenchmarkStringDefault/go-runewidth-8 83907 14369 ns/op 117.41 MB/s 0 B/op 0 allocs/op
BenchmarkString_EAW/displaywidth-8 112807 10646 ns/op 158.46 MB/s 0 B/op 0 allocs/op
BenchmarkString_EAW/go-runewidth-8 50692 23977 ns/op 70.36 MB/s 0 B/op 0 allocs/op
BenchmarkString_StrictEmoji/displaywidth-8 113710 10601 ns/op 159.14 MB/s 0 B/op 0 allocs/op
BenchmarkString_StrictEmoji/go-runewidth-8 83220 14403 ns/op 117.13 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/displaywidth-8 1000000 1077 ns/op 118.83 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/go-runewidth-8 1000000 1173 ns/op 109.13 MB/s 0 B/op 0 allocs/op
BenchmarkString_Unicode/displaywidth-8 1367460 881.1 ns/op 150.94 MB/s 0 B/op 0 allocs/op
BenchmarkString_Unicode/go-runewidth-8 840982 1437 ns/op 92.57 MB/s 0 B/op 0 allocs/op
BenchmarkStringWidth_Emoji/displaywidth-8 403082 3022 ns/op 239.56 MB/s 0 B/op 0 allocs/op
BenchmarkStringWidth_Emoji/go-runewidth-8 247605 4821 ns/op 150.18 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/displaywidth-8 303606 3956 ns/op 128.17 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/go-runewidth-8 256921 4639 ns/op 109.30 MB/s 0 B/op 0 allocs/op
BenchmarkString_ControlChars/displaywidth-8 3795948 315.2 ns/op 104.70 MB/s 0 B/op 0 allocs/op
BenchmarkString_ControlChars/go-runewidth-8 3273128 364.7 ns/op 90.48 MB/s 0 B/op 0 allocs/op
BenchmarkRuneDefault/displaywidth-8 3772311 318.1 ns/op 433.82 MB/s 0 B/op 0 allocs/op
BenchmarkRuneDefault/go-runewidth-8 1753222 684.4 ns/op 201.63 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_EAW/displaywidth-8 8469133 142.6 ns/op 385.75 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_EAW/go-runewidth-8 2383420 502.9 ns/op 109.37 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_ASCII/displaywidth-8 19660138 62.01 ns/op 467.63 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_ASCII/go-runewidth-8 17664040 67.34 ns/op 430.68 MB/s 0 B/op 0 allocs/op
BenchmarkStringDefault/displaywidth-8 10537 ns/op 160.10 MB/s 0 B/op 0 allocs/op
BenchmarkStringDefault/go-runewidth-8 14162 ns/op 119.12 MB/s 0 B/op 0 allocs/op
BenchmarkString_EAW/displaywidth-8 10776 ns/op 156.55 MB/s 0 B/op 0 allocs/op
BenchmarkString_EAW/go-runewidth-8 23987 ns/op 70.33 MB/s 0 B/op 0 allocs/op
BenchmarkString_StrictEmoji/displaywidth-8 10892 ns/op 154.88 MB/s 0 B/op 0 allocs/op
BenchmarkString_StrictEmoji/go-runewidth-8 14552 ns/op 115.93 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/displaywidth-8 1116 ns/op 114.72 MB/s 0 B/op 0 allocs/op
BenchmarkString_ASCII/go-runewidth-8 1178 ns/op 108.67 MB/s 0 B/op 0 allocs/op
BenchmarkString_Unicode/displaywidth-8 896.9 ns/op 148.29 MB/s 0 B/op 0 allocs/op
BenchmarkString_Unicode/go-runewidth-8 1434 ns/op 92.72 MB/s 0 B/op 0 allocs/op
BenchmarkStringWidth_Emoji/displaywidth-8 3033 ns/op 238.74 MB/s 0 B/op 0 allocs/op
BenchmarkStringWidth_Emoji/go-runewidth-8 4841 ns/op 149.56 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/displaywidth-8 4064 ns/op 124.74 MB/s 0 B/op 0 allocs/op
BenchmarkString_Mixed/go-runewidth-8 4696 ns/op 107.97 MB/s 0 B/op 0 allocs/op
BenchmarkString_ControlChars/displaywidth-8 320.6 ns/op 102.93 MB/s 0 B/op 0 allocs/op
BenchmarkString_ControlChars/go-runewidth-8 373.8 ns/op 88.28 MB/s 0 B/op 0 allocs/op
BenchmarkRuneDefault/displaywidth-8 335.5 ns/op 411.35 MB/s 0 B/op 0 allocs/op
BenchmarkRuneDefault/go-runewidth-8 681.2 ns/op 202.58 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_EAW/displaywidth-8 146.7 ns/op 374.80 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_EAW/go-runewidth-8 495.6 ns/op 110.98 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_ASCII/displaywidth-8 63.00 ns/op 460.33 MB/s 0 B/op 0 allocs/op
BenchmarkRuneWidth_ASCII/go-runewidth-8 68.90 ns/op 420.91 MB/s 0 B/op 0 allocs/op
```

I use a similar technique in [this grapheme cluster library](https://github.com/clipperhouse/uax29).

## Compatibility

`displaywidth` will mostly give the same outputs as `go-runewidth`, but there are some differences:

- Unicode category Mn (Nonspacing Mark): `displaywidth` will return width 0, `go-runewidth` may return width 1 for some runes.
- Unicode category Cf (Format): `displaywidth` will return width 0, `go-runewidth` may return width 1 for some runes.
- Unicode category Mc (Spacing Mark): `displaywidth` will return width 1, `go-runewidth` may return width 0 for some runes.
- Unicode category Cs (Surrogate): `displaywidth` will return width 0, `go-runewidth` may return width 1 for some runes. Surrogates are not valid UTF-8; some packages may turn them into the replacement character (U+FFFD).
- Unicode category Zl (Line separator): `displaywidth` will return width 0, `go-runewidth` may return width 1.
- Unicode category Zp (Paragraph separator): `displaywidth` will return width 0, `go-runewidth` may return width 1.
- Unicode Noncharacters (U+FFFE and U+FFFF): `displaywidth` will return width 0, `go-runewidth` may return width 1.

See `TestCompatibility` for more details.
43 changes: 28 additions & 15 deletions internal/gen/trie.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ func GenerateTrie(data *UnicodeData) (*triegen.Trie, error) {

// Insert all characters with non-default properties
inserted := 0
for r := rune(0); r <= 0x10FFFF; r++ {
for r := rune(0); r <= unicode.MaxRune; r++ {
// Skip surrogate characters (U+D800-U+DFFF) and other invalid characters
if r >= 0xD800 && r <= 0xDFFF {
continue
Expand Down Expand Up @@ -51,10 +51,10 @@ func WriteTrieGo(trie *triegen.Trie, outputPath string) error {
fmt.Fprintf(buf, "package displaywidth\n\n")
fmt.Fprintf(buf, "import \"github.com/clipperhouse/displaywidth/internal/stringish\"\n\n")

// Write character properties definitions
writeCharProperties(buf)
// Write property definitions
writeProperties(buf)

// Generate the trie using triegen
// Generate the trie using triegen (it will use uint8/uint16/etc directly)
size, err := trie.Gen(buf)
if err != nil {
return fmt.Errorf("failed to generate trie: %v", err)
Expand All @@ -79,6 +79,15 @@ func WriteTrieGo(trie *triegen.Trie, outputPath string) error {
genericLookupCallSig := `lookupValue(`
b = bytes.ReplaceAll(b, []byte(lookupCallSig), []byte(genericLookupCallSig))

// Replace uint8 return type in lookup with property and add necessary casts
b = bytes.ReplaceAll(b, []byte(") (v uint8, sz int)"), []byte(") (v property, sz int)"))
b = bytes.ReplaceAll(b, []byte(") uint8 {"), []byte(") property {"))
b = bytes.ReplaceAll(b, []byte("func lookupValue(n uint32, b byte) uint8"), []byte("func lookupValue(n uint32, b byte) property"))
// Cast return values from Values array (uint8) to property
b = bytes.ReplaceAll(b, []byte("return stringWidthValues["), []byte("return property(stringWidthValues["))
b = bytes.ReplaceAll(b, []byte("], 1"), []byte("]), 1"))
b = bytes.ReplaceAll(b, []byte("return uint8(stringWidthValues["), []byte("return property(stringWidthValues["))

formatted, err := format.Source(b)
if err != nil {
return err
Expand All @@ -99,19 +108,23 @@ func WriteTrieGo(trie *triegen.Trie, outputPath string) error {
return nil
}

// writeCharProperties writes the character properties definitions
func writeCharProperties(w io.Writer) {
// writeProperties writes the character properties definitions to the buffer.
// It uses PropertyDefinitions from unicode.go as the single source of truth.
func writeProperties(w io.Writer) {
fmt.Fprintf(w, "// property represents the properties of a character as bit flags\n")
fmt.Fprintf(w, "// The underlying type is uint8 since we only use %d bits for flags.\n", len(PropertyDefinitions))
fmt.Fprintf(w, "type property uint8\n\n")
fmt.Fprintf(w, "const (\n")
fmt.Fprintf(w, "\t// East Asian Width properties\n")
fmt.Fprintf(w, "\t_EAW_Fullwidth property = 1 << iota // F\n")
fmt.Fprintf(w, "\t_EAW_Wide // W\n")
fmt.Fprintf(w, "\t_EAW_Ambiguous // A\n\n")
fmt.Fprintf(w, "\t// General categories\n")
fmt.Fprintf(w, "\t_CombiningMark // Mn, Me (Mc excluded for proper width)\n")
fmt.Fprintf(w, "\t_ControlChar // C0, C1, DEL\n")
fmt.Fprintf(w, "\t_ZeroWidth // ZWSP, ZWJ, ZWNJ, etc.\n")
fmt.Fprintf(w, "\t_Emoji // Emoji base characters\n")

for i, prop := range PropertyDefinitions {
constName := "_" + prop.Name

if i == 0 {
fmt.Fprintf(w, "\t%s property = 1 << iota // %s\n", constName, prop.Comment)
} else {
fmt.Fprintf(w, "\t%s // %s\n", constName, prop.Comment)
}
}

fmt.Fprintf(w, ")\n\n")
}
33 changes: 17 additions & 16 deletions internal/gen/triegen/triegen.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,23 +34,24 @@
// triegen generates both tables and code. The code is optimized to use the
// automatically chosen data types. The following code is generated for a Trie
// or multiple Tries named "foo":
// - type fooTrie
// The trie type.
//
// - func newFooTrie(x int) *fooTrie
// Trie constructor, where x is the index of the trie passed to Gen.
// - type fooTrie
// The trie type.
//
// - func (t *fooTrie) lookup(s []byte) (v uintX, sz int)
// The lookup method, where uintX is automatically chosen.
// - func newFooTrie(x int) *fooTrie
// Trie constructor, where x is the index of the trie passed to Gen.
//
// - func lookupString, lookupUnsafe and lookupStringUnsafe
// Variants of the above.
// - func (t *fooTrie) lookup(s []byte) (v uintX, sz int)
// The lookup method, where uintX is automatically chosen.
//
// - var fooValues and fooIndex and any tables generated by Compacters.
// The core trie data.
// - func lookupString, lookupUnsafe and lookupStringUnsafe
// Variants of the above.
//
// - var fooTrieHandles
// Indexes of starter blocks in case of multiple trie roots.
// - var fooValues and fooIndex and any tables generated by Compacters.
// The core trie data.
//
// - var fooTrieHandles
// Indexes of starter blocks in case of multiple trie roots.
//
// It is recommended that users test the generated trie by checking the returned
// value for every rune. Such exhaustive tests are possible as the number of
Expand Down Expand Up @@ -377,13 +378,13 @@ func maxValue(n *node, max uint64) uint64 {
func getIntType(v uint64) (string, int) {
switch {
case v < 1<<8:
return "property", 1
return "uint8", 1
case v < 1<<16:
return "property", 2
return "uint16", 2
case v < 1<<32:
return "property", 4
return "uint32", 4
}
return "property", 8
return "uint64", 8
}

const (
Expand Down
Loading