-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Devanagari's zero-width characters are not accounted for properly #47
Comments
This is probably (=definitely) caused by https://en.wikipedia.org/wiki/Conjunct_consonant in Devanagari where you get a "ligature", in this case when ["'क'", "'्'", "'ष'"] > क्ष. I don't see anything that handles this in the code, and it looks super-tedious to implement for each possible ligature in Unicode (because I guess you'd have to do it by hand). Also note that there's an annoying trap in https://en.wikipedia.org/wiki/Zero-width_joiner, where U+200D does different things depending on the script (Devanagari vs. Sinhala from the article); sometimes it will create a combined character (if those characters don't combine by default), sometimes it will prevent a combined character (if those characters do combine by default). |
AFAIK in a layout engine, it is done in a text shaping step (usaully by harfbuzz). I guess the algorithm must be documented somewhere on unicode.org, but unfortunately I didn't research it. Any information about it will be helpful.
If I understand this correctly, it sounds like ambiguous Unicode characters. As its name implies, the width of an ambiguous character is ambiguous, and can be treated as either halfwidth or fullwidth. The unicode standard suggests that:
Obviously, terminal emulators and TUI toolkits don't have such context, so they always treat ambiguous characters as narrow characters. So the question is, how do popular terminals and TUI toolkits treat Devanagari and Sinhala? It is really important for table alignment. We should investigate it first. BTW, as talked about in #39, emoji are also affected by zero-width joiners. For the emoji |
Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
Some zero-width characters are now correctly accounted for in today's release by #91, but I think Devanagari support needs much more testing and review. It does appear that Devanagari combining characters can modify the width in advance ways that wcwidth isn't equipped for. We currently categorize characters into tables for widths 0 and 2. I think Devanagari is more complex than that, and making matters more difficult, I get different results from terminals, both in their width and how it is displayed, and I think sometimes the result is not readable or correctly combined, with characters squished in unreadable ways. |
Thank you very much for your efforts!
While the software may allow infinite concatenation of consonants (like क्ख्ग्घ्च्छ्ज्झ्ट्ठ्ड्ढ...), one is extremely unlikely to encounter it in practice. Almost 100% of the time, one doesn't have something more complex than a conjunct containing a half consonant, and a consonant with it's vowel diatric (full consonant). For example, त्पि ( I have yet to see any word not following the above, say प्क्ष (pkSa, this contains 3 back to back consonants with lesser width) — this is impossible to pronounce, and naturally a word close to it has the spelling पक्ष (pakSa).
A terminal may break up the word or conjunct into the constituent Unicode characters (and maybe as a result increase width due to monospace formatting??), that could be ignored. |
Thanks for the extra discussion. If there is a utf-8 text file of Devanagari that is generally representative of how the language might be used in a terminal, please do suggest one! Maybe like the first chapter of an out-of-copyright work from gutenberg.org, for example. For automatic testing of language support in https://github.com/jquast/ucs-detect, text documents for the world's languages from https://unicode.org/udhr/ was used, but Devanagari is not included there! I will be happy to add Devanagari to this tool to help fix wcwidth given any example text file. I spent some time on a specific phrase and noted my experience in testing with some references, Lines 349 to 395 in 2059ee1
|
Also about
|
Hindi Wikipedia? The license is permissive IIRC. Here's an article on Devanagari in Hindi: https://hi.wikipedia.org/wiki/देवनागरी/. It also has a table of conjuncts, as well as a small list (though the section title seems to be machine translated). This article on Devanagari conjuncts (English) might help: https://en.wikipedia.org/wiki/Devanagari_conjuncts Or if you want Sanskrit text: https://ambuda.org/texts/mahabharatam/1.1/. Clicking on the words, sometimes you may see how a conjunct might have formed. Same text but with parallel transliteration: https://sacred-texts.com/hin/mbs/mbs01001.htm (spotted a typo but doesn't really matter for our use case).
Thank you! |
After some time with your Wikipedia links I understand that Devanagari is a script (and unicode plane), and Sanskrit and Hindi are languages that make use of it, and, because both of those are included in UDHR used by ucs-detect, I can now say for certain that for supporting terminal emulators ("mlterm" and "kitty"), the previous release of wcwidth supports both Hindi and Sanskrit (Grantha) correctly. Thanks for your time and attention @siddhpant you've been a great help, best wishes |
Thank you very much! BTW Grantha is a name of another script. I suppose "Sanskrit (Grantha)" probably means Sanskrit written in Grantha script, and not in Devanagari. |
I am trying to tabulate entries containing Devanagari characters using python-tabulate. The library uses wcwidth to calculate the visible length of a string, apparently on line 768 here.
I had opened an issue in astanin/python-tabulate#68 a while ago. The dev directed me to also open an issue here, so here I am. I will quote myself directly from the issue:
This is how it renders
versus
How it should render:
The text was updated successfully, but these errors were encountered: