-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong width for U+00AD #8
Comments
In case it is helpful, the draft table of character widths that we are currently planning to use can be found in this CharWidths.txt gist (each line of which is |
Interesting, what terminal are you testing the "character cells consumed when printed" on OSX? I too am using OSX, and on iTerm2 it displays as "a-b", consuming 3 characters, so wcwidth would be correct, here... I would need to see evidence of it not forwarding the cell when printed on at least some terminal emulators, and file bugs for the others. Just to be very clear, the purpose of wcwidth is "printable width on a terminal", and not firefox or anything else (for which such character is hidden). Also, I don't necessarily trust the OS-provided 'wcwidth', they are typically based on very old (5-10 years old) unicode specifications. I have a program I've tested on osx and linux, both are wildly different, and in each case my version was correct: https://github.com/jquast/wcwidth/blob/master/bin/wcwidth-libc-comparator.py The combining and wide character tables are programmatically updated by "python setup.py update", which is similar to your https://github.com/JuliaLang/utf8proc/pull/27/files#diff-3832b9cfe2fc10d35ac5c63d9b7b8133R20 There is no unicode specification reference tables for 0-width characters that I know of, so its just hardcoded here https://github.com/jquast/wcwidth/blob/master/wcwidth/wcwidth.py#L161-171 |
Using the 'Cf' category listings on iTerm2, it appears the following all consume 1 character cell, some with symbols, some simply by blanks
And the following consume 0 cells:
which may indeed need to be supported by wcwidth once i test a few more terminals |
(We don't trust the system-provided There is an interesting article on the soft hyphen, which apparently has had a controversial history, and is rendered in different ways depending on the font and the rendering system. I'm not sure what the right answer is here, but the Unicode standard seems to somewhat favor the viewpoint that it should be invisible although it leaves it up to the implementation. However, the article mentions that the Unicode FAQ does say In a terminal emulation environment, particularly in ISO-8859-1 contexts, one could display the soft hyphen as a hyphen in all circumstances and maybe that is what is done in practice. cc: @jiahao and @StefanKarpinski. |
Note that the Arabic characters U+0601 etc. are defined by the unicode standard as exceptions to usual rule that Cf characters are invisible:
In contrast, e.g. U+200E is a left-to-right mark, and in my understanding is defined as an invisible formatting character that controls the direction of the text. Some terminals may give it a nonzero width (although the MacOS Terminal with the default font gives it zero width on my machine), but that seems like a bug in the terminal (or the font); it seems like it is better to return what the Unicode standard says rather than propagating a particular buggy implementation. |
I remember that article about the soft hyphen. Under "Modern Unicode semantics" it references UAX 14 for Unicode 7.0.0, §5.4, which says:
The description in the following paragraphs suggests that the rendering of a soft hyphen is accomplished not by printing the soft hyphen itself, but rather by inserting an additional, printable hyphen glyph:
Based on this description it would seem that the character U+00AD by itself is nonprintable and should have a width of 0 or -1. |
Interestingly, the Unicode FAQ entry that the SHY article quoted seems to no longer exist — from that passage in the Unicode 7.0.0 standard that @jiahao quoted it seems like the Unicode consortium decided to put its foot down and and declare that the soft hyphen is definitely invisible, ISO 8859-1 be damned. |
I really appreciate all of the resarch, @stevengj and @jiahao. My decision is to use the common denominator across the most popular terminal emulators I've made a checklist:
Then, test the following and report:
I'm not sure how to gauge the "popular terminal emulators", this is just from memory. sidenote: More importantly, how to factor their weight in wcwidth for any given Finally, we can make a PR and release any update. |
@jquast thanks for your detailed consideration. As you had stated above, iTerm seems to have different needs from us at this point. |
However, I don't think it is possible to provide consistency across terminal environments without considering also the interactions with the choice of users' fonts. Many fonts simply have wrong advance widths for some code points. Here is a simple rendering text for the fixed width fonts on my system. Consider
should render with the hat combining character on the omega.
should render with a hat to the left of omega. |
You are correct, but terminal emulators don't typically care, they're the ones who handle the width of "printable cells" -- What is your system, is it a terminal emulator? |
The screenshots I pasted were taken from an IPython notebook rendering test HTML using those fonts. I can see the same spacing issues if I manually change the font in OSX Terminal and generate these characters in the Julia console REPL. |
Version wcwidth 0.1.5 which includes better combining character width determination by PR #11 is available on pypi. A terminal sequence may be emitted to illicit the terminal emulator to respond with its cursor position. This can be used to manually display all questionable characters across different popular Font face profiles and terminal emulators, and programatically determine whether they consider it 0 width for such characters, making a report of the most common discrepenancies, weighing on the side of "most correct", resolving any. |
Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
This is closed by #91
|
For reference, in glibc Judging by this discussion: https://sourceware.org/bugzilla/show_bug.cgi?id=22073 which concluded that it should be 1. That discussion took place in 2017 - after the main discussion in this issue, but before the last #8 (comment) here. Also, in musl-libc, |
It's a bit ambiguous isn't it? From https://codepoints.net/U+00AD,
I will add a test to ucs-detect and whichever measured width (0 or 1) that is used among the most popular and compliant terminals will be used in this library. |
For what it's worth, the musl-libc maintainer, @richfelker said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption". Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?). Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc. And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice. |
Thanks for relaying @richfelke <https://github.com/richfelker>‘s thoughts, I’m in full agreement with all of them, especially for -1 as this kind of character is meant to be managed by the terminal emulator, and it’s width is indeterminate (like \n, \t, etc). But if the most popular terminal emulators measure it as width of 1 then I’d like to match
…--
Jeff Quast
***@***.***
On Wed, Mar 13, 2024, at 2:53 PM, avih wrote:
For what it's worth, the musl-libc maintainer, @richfelker <https://github.com/richfelker> said on IRC that he thinks it should be 1 because historically it was 1 (in most/all implementations?), and, quoting, "(dalias) unless there's widespread agreement between terminals and wcwidth implementations, all you get by changing it is screen corruption".
Additionally, it was not discussed on the musl mailing lists, possibly because that was acceptable (or no one noticed or cared?).
Additionally, he noted that if anything, it should have probably been -1 and not 0, because if applied, then it affects formatting, not unlike carriage-return or newline or form-feed etc.
And finally, he mentions that "it's widely unused anyway", which is probably true, hence probably not too important overall, though agreement between wcwidth implementations would still be nice.
—
Reply to this email directly, view it on GitHub <#8 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHNOKBLY2HT5NLTDSE46HDYYCOC7AVCNFSM4A5VFAVKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGUZTQMBZHA4A>.
You are receiving this because you modified the open/close state.Message ID: ***@***.***>
|
Right. I would guess that terminals measure its width according to the And so ultimately, I would think the goal should be agreement between wcwidth implementations, rather than between this implementation and the behavior of popular terminal emulators? |
Ultimately, the utf8proc library decided to also report a width of |
Well, that was not a good argument, and I would agree that if this was the only or main wcwidth implementation, then it should try to match the common terminal emulators behavior. But because this is one of several wcwidth implementations, its goal should be to agree with other wcwidth implementations rather the terminals. That being said, it would still be nice to know how terminals handle it. At which case, the test should be dual:
I would guess that most terminals don't handle it dually like the Unicode semantics suggests (and would imply a -1 wcwidth value), hence they probably treat it as always 1 or always 0, though that's a guess. |
So, I tested it in the following terminals on Alpine linux 3.19.1, and all the tested terminal emulators treat it either as hard 0 or hard 1. I.e. no terminal handles it dually as 0 at the middle of the line and hyphen+wordbreak in a word which spills over the end of the line. Specifically, I tested using this script, and observed the result on-screen (not automated). the SHY byte is always at this word EDITED: THIS SCRIPT IS BROKEN AND THE RESULTS ARE INVALID. See fixed script at the next post. test-shy.sh (broken)#!/bin/sh
dots() {
R=
while [ ${#R} -lt $1 ]; do R=$R.; done
echo "$R"
}
has() { command -v "$1" >/dev/null; }
nth() { shift $1; printf %s\\n "$1"; }
cols() {
if [ "${COLUMNS-}" ]; then echo $COLUMNS
elif has stty; then nth 2 $(stty size)
elif has ttysize; then nth 1 $(ttysize)
else echo 80; fi
}
cols=$(cols)
printf "$(dots $cols)\n\n"
printf "SHY mid line: aaa xxx\255yyy bbb\n\n"
printf "no SHY: $(dots $((cols - 16))) aaa xxxyyy bbb\n\n"
printf "SHY before last column: $(dots $((cols - 34))) aaa xxx\255yyy bbb\n\n"
printf "SHY at the last column: $(dots $((cols - 33))) aaa xxx\255yyy bbb\n\n" All the terminals were invoked with UTF-8 locale, e.g.: LC_ALL=en_US.UTF-8 xterm Results:
urxvt: similat to xterm etc. above, but always displays it as a hyphen, as if wcwidth(0xad) == 1. alacritty 0.12.3 and kitty 0.31.0: seem to ignore it at the input, as if wcwidth(0xad) == 0: So while 1 is common, I don't think it's black and white. So I would think the goal should be to match other wcwidth implementations, where the value appears to be 1 at least in glibc, musl, and utf8proc. |
Here's a summary of the U+00AD SOFY-HYPHEN behavior:
Therefore I think it should be added/restored as an overriding exception - return 1 for original comment by Markus Kuhn from the linked file:/* The following two functions define the column width of an ISO 10646
* character as follows:
*
* - The null character (U+0000) has a column width of 0.
*
* - Other C0/C1 control characters and DEL will lead to a return
* value of -1.
*
* - Non-spacing and enclosing combining characters (general
* category code Mn or Me in the Unicode database) have a
* column width of 0.
*
* - SOFT HYPHEN (U+00AD) has a column width of 1.
*
* - Other format characters (general category code Cf in the Unicode
* database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
*
* - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
* have a column width of 0.
*
* - Spacing characters in the East Asian Wide (W) or East Asian
* Full-width (F) category as defined in Unicode Technical
* Report #11 have a column width of 2.
*
* - All remaining characters (including all printable
* ISO 8859-1 and WGL4 characters, Unicode control characters,
* etc.) have a column width of 1.
... |
utf8proc now returns 1 as well (JuliaStrings/utf8proc#135). |
Hi, I was looking at your
wcwidth
library for comparison, since in the utf8proc library we are also implementing a similar feature (see JuliaStrings/utf8proc#2). The first disagreement that I came across between your implementation and ours was for U+00AD (soft hyphen), where you seem to give1
and we give zero (a soft hyphen is used for line breaking, but is ordinarily not printed). In general, we return 0 for most characters in category Cf (formatting control characters). The
wcwidth
function on MacOS 10.10.2 also returns-1
(not printable) for this code point.Am I calling your implementation incorrectly? This is for git
master
of wcwidth.The text was updated successfully, but these errors were encountered: