Devanagari's zero-width characters are not accounted for properly #47

siddhpant · 2021-02-26T09:24:23Z

I am trying to tabulate entries containing Devanagari characters using python-tabulate. The library uses wcwidth to calculate the visible length of a string, apparently on line 768 here.

I had opened an issue in astanin/python-tabulate#68 a while ago. The dev directed me to also open an issue here, so here I am. I will quote myself directly from the issue:

This is how it renders

Name            Score
------------  -------
राष्ट्र परीक्षण    19.25
Test             0

versus

Name               Score
---------------  -------
Devanagari here    19.25
Test                0

How it should render:

Name            Score
------------  -------
राष्ट्र परीक्षण         19.25
Test             0

The text was updated successfully, but these errors were encountered:

Naeddyr · 2023-02-04T10:26:26Z

This is probably (=definitely) caused by https://en.wikipedia.org/wiki/Conjunct_consonant in Devanagari where you get a "ligature", in this case when ["'क'", "'्'", "'ष'"] > क्ष. I don't see anything that handles this in the code, and it looks super-tedious to implement for each possible ligature in Unicode (because I guess you'd have to do it by hand).

Also note that there's an annoying trap in https://en.wikipedia.org/wiki/Zero-width_joiner, where U+200D does different things depending on the script (Devanagari vs. Sinhala from the article); sometimes it will create a combined character (if those characters don't combine by default), sometimes it will prevent a combined character (if those characters do combine by default).

GalaxySnail · 2023-02-04T12:10:40Z

This is probably (=definitely) caused by https://en.wikipedia.org/wiki/Conjunct_consonant in Devanagari where you get a "ligature", in this case when ["'क'", "'्'", "'ष'"] > क्ष. I don't see anything that handles this in the code, and it looks super-tedious to implement for each possible ligature in Unicode (because I guess you'd have to do it by hand).

AFAIK in a layout engine, it is done in a text shaping step (usaully by harfbuzz). I guess the algorithm must be documented somewhere on unicode.org, but unfortunately I didn't research it. Any information about it will be helpful.

Also note that there's an annoying trap in https://en.wikipedia.org/wiki/Zero-width_joiner, where U+200D does different things depending on the script (Devanagari vs. Sinhala from the article); sometimes it will create a combined character (if those characters don't combine by default), sometimes it will prevent a combined character (if those characters do combine by default).

If I understand this correctly, it sounds like ambiguous Unicode characters. As its name implies, the width of an ambiguous character is ambiguous, and can be treated as either halfwidth or fullwidth. The unicode standard suggests that:

Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default.

Obviously, terminal emulators and TUI toolkits don't have such context, so they always treat ambiguous characters as narrow characters.

So the question is, how do popular terminals and TUI toolkits treat Devanagari and Sinhala? It is really important for table alignment. We should investigate it first. BTW, as talked about in #39, emoji are also affected by zero-width joiners. For the emoji 👨‍👧‍👦, some terminals render a single emoji while some other terminals render three emoji [1].

[1] https://news.ycombinator.com/item?id=30113521

Major ----- Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow ! This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables. Tests ----- - `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication. - new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada. - added pytest-benchmark plugin, example use: # baseline tox -epy312 -- --verbose --benchmark-save=original # compare tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json

jquast · 2023-10-30T20:04:49Z

Some zero-width characters are now correctly accounted for in today's release by #91, but I think Devanagari support needs much more testing and review.

It does appear that Devanagari combining characters can modify the width in advance ways that wcwidth isn't equipped for. We currently categorize characters into tables for widths 0 and 2. I think Devanagari is more complex than that, and making matters more difficult, I get different results from terminals, both in their width and how it is displayed, and I think sometimes the result is not readable or correctly combined, with characters squished in unreadable ways.

siddhpant · 2023-10-30T20:42:54Z

Some zero-width characters are now correctly accounted for in today's release by #91, but I think Devanagari support needs much more testing and review.

Thank you very much for your efforts!

Devanagari combining characters can modify the width in advance ways

I think sometimes the result is not readable or correctly combined, with characters squished in unreadable ways.

While the software may allow infinite concatenation of consonants (like क्ख्ग्घ्च्छ्ज्झ्ट्ठ्ड्ढ...), one is extremely unlikely to encounter it in practice.

Almost 100% of the time, one doesn't have something more complex than a conjunct containing a half consonant, and a consonant with it's vowel diatric (full consonant). For example, त्पि (tpi in HK or ITRANS) — consonant (t) + { consonant (p) + diatric (i) }. So maybe you can generate them with a script to see (not all will be valid anyways).

I have yet to see any word not following the above, say प्क्ष (pkSa, this contains 3 back to back consonants with lesser width) — this is impossible to pronounce, and naturally a word close to it has the spelling पक्ष (pakSa).

I get different results from terminals, both in their width and how it is displayed

A terminal may break up the word or conjunct into the constituent Unicode characters (and maybe as a result increase width due to monospace formatting??), that could be ignored.

jquast · 2023-10-30T21:20:13Z

Thanks for the extra discussion. If there is a utf-8 text file of Devanagari that is generally representative of how the language might be used in a terminal, please do suggest one! Maybe like the first chapter of an out-of-copyright work from gutenberg.org, for example.

For automatic testing of language support in https://github.com/jquast/ucs-detect, text documents for the world's languages from https://unicode.org/udhr/ was used, but Devanagari is not included there! I will be happy to add Devanagari to this tool to help fix wcwidth given any example text file.

I spent some time on a specific phrase and noted my experience in testing with some references,

wcwidth/tests/test_core.py

Lines 349 to 395 in 2059ee1

    
           def test_devanagari_script(): 
        
               """ 
        
               Attempt to test the measurement width of Devanagari script. 
        
               I believe this 'phrase' should be length 3. 
        
               This is a difficult problem, and this library does not yet get it right, 
        
               because we interpret the unicode data files programmatically, but they do 
        
               not correctly describe how their terminal width is measured. 
        
               There are very few Terminals that do! 
        
               As of 2023, 
        
               - iTerm2: correct length but individual characters are out of order and 
        
                         horizaontally misplaced as to be unreadable in its language when 
        
                         using 'Noto Sans' font. 
        
               - mlterm: mixed results, it offers several options in the configuration 
        
                         dialog, "Xft", "Cario", and "Variable Column Width" have some 
        
                         effect, but with neither 'Noto Sans' or 'unifont', it is not 
        
                         recognizable as the Devanagari script it is meant to display. 
        
               Previous testing with Devanagari documented at address https://benizi.com/vim/devanagari/ 
        
               See also, https://askubuntu.com/questions/8437/is-there-a-good-mono-spaced-font-for-devanagari-script-in-the-terminal 
        
               """ 
        
               # This test adapted from https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf 
        
               # please note that document correctly points out that the final width cannot be determined 
        
               # as a sum of each individual width, as this library currently performs with exception of 
        
               # ZWJ, but I think it incorrectly gestures what a stateless call to wcwidth.wcwidth of 
        
               # each codepoint *should* return. 
        
               phrase = (u"\u0915"    # Akhand, Category 'Lo', East Asian Width property 'N' -- DEVANAGARI LETTER KA 
        
                         u"\u094D"    # Joiner, Category 'Mn', East Asian Width property 'N' -- DEVANAGARI SIGN VIRAMA 
        
                         u"\u0937"    # Fused, Category 'Lo', East Asian Width property 'N' -- DEVANAGARI LETTER SSA 
        
                         u"\u093F")   # MatraL, Category 'Mc', East Asian Width property 'N' -- DEVANAGARI VOWEL SIGN I 
        
               # 23107-terminal-suppt.pdf suggests wcwidth.wcwidth should return (2, 0, 0, 1) 
        
               expect_length_each = (1, 0, 1, 0) 
        
               # I believe the final width *should* be 3. 
        
               expect_length_phrase = 2 
        
               # exercise, 
        
               length_each = tuple(map(wcwidth.wcwidth, phrase)) 
        
               length_phrase = wcwidth.wcswidth(phrase) 
        
               # verify. 
        
               assert length_each == expect_length_each 
        
               assert length_phrase == expect_length_phrase

jquast · 2023-10-30T21:26:36Z

Also about 'त्पि', the previous release of wcwidth measured it as 3 cells, but today's release of wcwidth more correctly measures it as 2 cells.

>>> l='त्पि'
>>> len(l)
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mn', 'Lo', 'Mc']
>>> [unicodedata.name(x) for x in l]
['DEVANAGARI LETTER TA', 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER PA', 'DEVANAGARI VOWEL SIGN I']
>>> [wcwidth.wcwidth(x) for x in l]
[1, 0, 1, 0]
>>> print('त्पि|\n12|')
त्पि|
12|

siddhpant · 2023-11-01T08:23:24Z

Maybe like the first chapter of an out-of-copyright work from gutenberg.org, for example.

Hindi Wikipedia? The license is permissive IIRC. Here's an article on Devanagari in Hindi: https://hi.wikipedia.org/wiki/देवनागरी/. It also has a table of conjuncts, as well as a small list (though the section title seems to be machine translated).

This article on Devanagari conjuncts (English) might help: https://en.wikipedia.org/wiki/Devanagari_conjuncts

Or if you want Sanskrit text: https://ambuda.org/texts/mahabharatam/1.1/. Clicking on the words, sometimes you may see how a conjunct might have formed.

Same text but with parallel transliteration: https://sacred-texts.com/hin/mbs/mbs01001.htm (spotted a typo but doesn't really matter for our use case).

Also about 'त्पि', the previous release of wcwidth measured it as 3 cells, but today's release of wcwidth more correctly measures it as 2 cells.

Thank you!

jquast · 2023-11-08T17:11:02Z

After some time with your Wikipedia links I understand that Devanagari is a script (and unicode plane), and Sanskrit and Hindi are languages that make use of it, and, because both of those are included in UDHR used by ucs-detect, I can now say for certain that for supporting terminal emulators ("mlterm" and "kitty"), the previous release of wcwidth supports both Hindi and Sanskrit (Grantha) correctly.

Thanks for your time and attention @siddhpant you've been a great help, best wishes

siddhpant · 2023-11-08T17:14:55Z

Thank you very much!

BTW Grantha is a name of another script. I suppose "Sanskrit (Grantha)" probably means Sanskrit written in Grantha script, and not in Devanagari.

jquast added the bug label Aug 5, 2021

jquast mentioned this issue Oct 19, 2023

Bugfixes for zero-width characters #91

Merged

jquast mentioned this issue Oct 31, 2023

wcswidth incorrect for heart emoji, ❤️ ("\u2764\ufe0f") #96

Closed

jquast closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devanagari's zero-width characters are not accounted for properly #47

Devanagari's zero-width characters are not accounted for properly #47

siddhpant commented Feb 26, 2021

Naeddyr commented Feb 4, 2023 •

edited

Loading

GalaxySnail commented Feb 4, 2023

jquast commented Oct 30, 2023 •

edited

Loading

siddhpant commented Oct 30, 2023 •

edited

Loading

jquast commented Oct 30, 2023

jquast commented Oct 30, 2023

siddhpant commented Nov 1, 2023 •

edited

Loading

jquast commented Nov 8, 2023

siddhpant commented Nov 8, 2023 •

edited

Loading

Devanagari's zero-width characters are not accounted for properly #47

Devanagari's zero-width characters are not accounted for properly #47

Comments

siddhpant commented Feb 26, 2021

Naeddyr commented Feb 4, 2023 • edited Loading

GalaxySnail commented Feb 4, 2023

jquast commented Oct 30, 2023 • edited Loading

siddhpant commented Oct 30, 2023 • edited Loading

jquast commented Oct 30, 2023

jquast commented Oct 30, 2023

siddhpant commented Nov 1, 2023 • edited Loading

jquast commented Nov 8, 2023

siddhpant commented Nov 8, 2023 • edited Loading

Naeddyr commented Feb 4, 2023 •

edited

Loading

jquast commented Oct 30, 2023 •

edited

Loading

siddhpant commented Oct 30, 2023 •

edited

Loading

siddhpant commented Nov 1, 2023 •

edited

Loading

siddhpant commented Nov 8, 2023 •

edited

Loading