-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: charwidth function #27
Conversation
elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024 | ||
contains(line, "Encoding:") && (codepoint = int(split(line)[3])) | ||
contains(line, "Width:") && (width = int(split(line)[2])) | ||
if codepoint!=nothing && width!=nothing && codepoint >= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiahao, I added the codepoint>=0 check here since the sfd file seems to have some codepoint -1 entries in the beginning.
Maybe return |
@nalimilan, currently I'm returning |
I just noticed the Python-based wcwidth repo on github; it would be good to compare to their behavior. |
@jiahao, I noticed that your algorithm seems to assign a nonzero width to category Mn (non-spacing combining characters). e.g. U+0300 (combining grave accent) is assigned a width of 1. That seems weird to me...shouldn't we assign these a width of zero, since in practice their width is subsumed by the width of the character they are combined with? |
@staticfloat, Travis seems to be failing occasionally because the Unifont downloads are stalling. Is there some problem with large (few MB) downloads in Travis? If so, how do the julia Travis builds work? |
We've got some caching infrastructure that we use to make sure we get consistent results with as high uptime as possible. The basic formula is to change your request URLs to something like |
@staticfloat, it would be great if you added the unifont |
Alright, it should be live now. Try changing the URLs and let me know. |
Thanks @staticfloat, |
(Another weird thing: why does the Travis badge for the |
Ah, it looks like I need to pass |
I think the markdown for the badge link is currently set to "most recent build for project," it needs some extra arguments to be specifically for master |
@jiahao, it looks like the problem is the UAX11 code. That assigns a width "1" to category "A" (ambiguous), which includes the Mn non-spacing combining marks in EastAsianWidth.txt. Like I mentioned above, I'm not sure how your UAX11 code worked at all in your notebook, because you divided the widths by 512 after the UAX11 code, which would have set all of those widths to zero. Maybe I just shouldn't check UAX11 at all, or should check that before Unifont (so that Unifont can override UAX11)? |
Okay, just pushed another update. Now Unifont takes precedence over UAX11, and I added a check. Unifont includes PUA glyphs from the ConScript registry (e.g. Klingon, Elvish, etcetera), but I'm skeptical that we should return nonzero widths for those as they aren't part of the Unicode standard, so I changed the algorithm to return zero for these (category Co); we can always add them back later. It now agrees with the MacOS 10.10.2 |
@staticfloat, I'm still getting Travis timeouts on the downloads. Am I using the caching server incorrectly somehow? |
#0x002003 ' ' category: Zs name: EM SPACE/ | ||
CharWidths[0x2001]=2 | ||
CharWidths[0x2003]=2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiahao, in a monospaced font, like in a terminal, aren't em and en the same?
@stevengj thanks for taking the lead on actually making this happen. Dividing by 512: probably a silly bug. I may have tinkered with the notebook since the version I posted. Category Mn having width 1: I think you're right, they should have width 0. The oversight did occur to me at some point but I don't think I updated the gist since I realized the mistake. The 'ambiguous' category was difficult for me to figure out. Some characters like ❶ (U+02776 DINGBAT NEGATIVE CIRCLED DIGIT ONE) I have only ever seen in printed Chinese literature, where it is typeset as fullwidth. Others like ¸ (U+00B8 CEDILLA, a spacing character not to be confused with ̧ U+0327 COMBINING CEDILLA) seem silly to treat as ambiguous. I guess if ❶ were to ever show up in non-East Asian text the rule is to assign it as narrow? UAX 11 §4.2 Ambiguous Characters is really, well, ambiguous. Or one can simply adopt the recommendation of §5 which suggests either a) always take Unifont's size (as the canonical monospaced font) as correct, or b) pretending that they are always narrow, are both acceptable:
em vs en: The 'em' is defined to be equal to the current point size and the 'en' is defined to be half an 'em'. Given that the bounding box for Unifont was 1024 x 512 for most of the ordinary narrow characters in Fontforge, I assigned anything defined to be 'en'-sized as width 1 and anything 'em'-sized as width 2. |
@stevengj I think I forgot to enable a certain fix on the caching server, it's been updated now, and I've re-run the jobs which seemed to work better. |
Thanks! @jiahao, |
@jiahao, Unifont gives width=1 for four "W" codepoints U+2329, U+232a, U+fe33, U+fe34 and one "F" codepoint U+ff3d according to (Unifont gives width 1 for all codepoints listed as narrow or halfwidth, so there is no conflict there.) I think I'll let Unifont "win" for the ambiguous and neutral cases. |
Yes I recall seeing some discrepancies in the Unifont W characters having width 1. I don't remember the details but I think the Unifont bounding boxes were quite clearly erroneous. (Some of them are clearly fullwidth Chinese characters.) |
Letting Unifont decide neutral and ambiguous seems better to me too, especially since anyone who actually wants monospaced Unicode will have little choice other than Unifont in practice. |
This PR fixes #2 by following the algorithm suggested by @jiahao: it uses character widths from GNU unifont, plus a few exceptions from UAX 11 and some control characters.
One difference from @jiahao's script is that I divide unifont advance widths by 512 before incorporating the UAX 11 data, whereas @jiahao's notebook did the division afterwards, which seemed like a bug(??).
Also, I'm not quite sure what we should do about non-printable characters. Should we return -1 for non-printable characters like wcwidth? If so, what algorithm for "printable" should we use? @jiahao's notebook had one algorithm, but Julia's isprint function subsequently adopted a different algorithm. Right now, I am returning 0; my feeling is that if the caller wants to know some more specific notion of "printable," they can always inspect the unicode category code provided by utf8proc.