Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devanagari's zero-width characters are not accounted for properly #47

Closed
siddhpant opened this issue Feb 26, 2021 · 9 comments
Closed
Labels

Comments

@siddhpant
Copy link

I am trying to tabulate entries containing Devanagari characters using python-tabulate. The library uses wcwidth to calculate the visible length of a string, apparently on line 768 here.

I had opened an issue in astanin/python-tabulate#68 a while ago. The dev directed me to also open an issue here, so here I am. I will quote myself directly from the issue:


This is how it renders

Name            Score
------------  -------
राष्ट्र परीक्षण    19.25
Test             0

versus

Name               Score
---------------  -------
Devanagari here    19.25
Test                0

How it should render:

Name            Score
------------  -------
राष्ट्र परीक्षण         19.25
Test             0
@jquast jquast added the bug label Aug 5, 2021
@Naeddyr
Copy link

Naeddyr commented Feb 4, 2023

This is probably (=definitely) caused by https://en.wikipedia.org/wiki/Conjunct_consonant in Devanagari where you get a "ligature", in this case when ["'क'", "'्'", "'ष'"] > क्ष. I don't see anything that handles this in the code, and it looks super-tedious to implement for each possible ligature in Unicode (because I guess you'd have to do it by hand).

Also note that there's an annoying trap in https://en.wikipedia.org/wiki/Zero-width_joiner, where U+200D does different things depending on the script (Devanagari vs. Sinhala from the article); sometimes it will create a combined character (if those characters don't combine by default), sometimes it will prevent a combined character (if those characters do combine by default).

@GalaxySnail
Copy link
Collaborator

This is probably (=definitely) caused by https://en.wikipedia.org/wiki/Conjunct_consonant in Devanagari where you get a "ligature", in this case when ["'क'", "'्'", "'ष'"] > क्ष. I don't see anything that handles this in the code, and it looks super-tedious to implement for each possible ligature in Unicode (because I guess you'd have to do it by hand).

AFAIK in a layout engine, it is done in a text shaping step (usaully by harfbuzz). I guess the algorithm must be documented somewhere on unicode.org, but unfortunately I didn't research it. Any information about it will be helpful.

Also note that there's an annoying trap in https://en.wikipedia.org/wiki/Zero-width_joiner, where U+200D does different things depending on the script (Devanagari vs. Sinhala from the article); sometimes it will create a combined character (if those characters don't combine by default), sometimes it will prevent a combined character (if those characters do combine by default).

If I understand this correctly, it sounds like ambiguous Unicode characters. As its name implies, the width of an ambiguous character is ambiguous, and can be treated as either halfwidth or fullwidth. The unicode standard suggests that:

Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default.

Obviously, terminal emulators and TUI toolkits don't have such context, so they always treat ambiguous characters as narrow characters.

So the question is, how do popular terminals and TUI toolkits treat Devanagari and Sinhala? It is really important for table alignment. We should investigate it first. BTW, as talked about in #39, emoji are also affected by zero-width joiners. For the emoji 👨‍👧‍👦, some terminals render a single emoji while some other terminals render three emoji [1].

[1] https://news.ycombinator.com/item?id=30113521

jquast added a commit that referenced this issue Oct 30, 2023
Major
-----

Bugfix zero-with characters, closes #57, #47, #45, #39, #26, #25, #24, #22, #8, wow !

This is mostly achieved by replacing `ZERO_WIDTH_CF` with dynamic parsing by Category codes in bin/update-tables.py and putting those in the zero-wide tables.

Tests
-----

- `verify-table-integrity.py` exercises a "bug" of duplicated tables that has no effect, because wcswidth() first checks for zero-width, and that is preferred in cases of conflict. This PR also resolves that error of duplication.
- new automatic tests for balinese, kr jamo, zero-width emoji, devanagari, tamil, kannada.  
- added pytest-benchmark plugin, example use:

        # baseline
        tox -epy312 -- --verbose --benchmark-save=original
        # compare
        tox -epy312 -- --verbose --benchmark-compare=.benchmarks/Linux-CPython-3.12-64bit/0001_original.json
@jquast
Copy link
Owner

jquast commented Oct 30, 2023

Some zero-width characters are now correctly accounted for in today's release by #91, but I think Devanagari support needs much more testing and review.

It does appear that Devanagari combining characters can modify the width in advance ways that wcwidth isn't equipped for. We currently categorize characters into tables for widths 0 and 2. I think Devanagari is more complex than that, and making matters more difficult, I get different results from terminals, both in their width and how it is displayed, and I think sometimes the result is not readable or correctly combined, with characters squished in unreadable ways.

@siddhpant
Copy link
Author

siddhpant commented Oct 30, 2023

Some zero-width characters are now correctly accounted for in today's release by #91, but I think Devanagari support needs much more testing and review.

Thank you very much for your efforts!

Devanagari combining characters can modify the width in advance ways

I think sometimes the result is not readable or correctly combined, with characters squished in unreadable ways.

While the software may allow infinite concatenation of consonants (like क्ख्ग्घ्च्छ्ज्झ्ट्ठ्ड्ढ...), one is extremely unlikely to encounter it in practice.

Almost 100% of the time, one doesn't have something more complex than a conjunct containing a half consonant, and a consonant with it's vowel diatric (full consonant). For example, त्पि (tpi in HK or ITRANS) — consonant (t) + { consonant (p) + diatric (i) }. So maybe you can generate them with a script to see (not all will be valid anyways).

I have yet to see any word not following the above, say प्क्ष (pkSa, this contains 3 back to back consonants with lesser width) — this is impossible to pronounce, and naturally a word close to it has the spelling पक्ष (pakSa).

I get different results from terminals, both in their width and how it is displayed

A terminal may break up the word or conjunct into the constituent Unicode characters (and maybe as a result increase width due to monospace formatting??), that could be ignored.

@jquast
Copy link
Owner

jquast commented Oct 30, 2023

Thanks for the extra discussion. If there is a utf-8 text file of Devanagari that is generally representative of how the language might be used in a terminal, please do suggest one! Maybe like the first chapter of an out-of-copyright work from gutenberg.org, for example.

For automatic testing of language support in https://github.com/jquast/ucs-detect, text documents for the world's languages from https://unicode.org/udhr/ was used, but Devanagari is not included there! I will be happy to add Devanagari to this tool to help fix wcwidth given any example text file.

I spent some time on a specific phrase and noted my experience in testing with some references,

wcwidth/tests/test_core.py

Lines 349 to 395 in 2059ee1

def test_devanagari_script():
"""
Attempt to test the measurement width of Devanagari script.
I believe this 'phrase' should be length 3.
This is a difficult problem, and this library does not yet get it right,
because we interpret the unicode data files programmatically, but they do
not correctly describe how their terminal width is measured.
There are very few Terminals that do!
As of 2023,
- iTerm2: correct length but individual characters are out of order and
horizaontally misplaced as to be unreadable in its language when
using 'Noto Sans' font.
- mlterm: mixed results, it offers several options in the configuration
dialog, "Xft", "Cario", and "Variable Column Width" have some
effect, but with neither 'Noto Sans' or 'unifont', it is not
recognizable as the Devanagari script it is meant to display.
Previous testing with Devanagari documented at address https://benizi.com/vim/devanagari/
See also, https://askubuntu.com/questions/8437/is-there-a-good-mono-spaced-font-for-devanagari-script-in-the-terminal
"""
# This test adapted from https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf
# please note that document correctly points out that the final width cannot be determined
# as a sum of each individual width, as this library currently performs with exception of
# ZWJ, but I think it incorrectly gestures what a stateless call to wcwidth.wcwidth of
# each codepoint *should* return.
phrase = (u"\u0915" # Akhand, Category 'Lo', East Asian Width property 'N' -- DEVANAGARI LETTER KA
u"\u094D" # Joiner, Category 'Mn', East Asian Width property 'N' -- DEVANAGARI SIGN VIRAMA
u"\u0937" # Fused, Category 'Lo', East Asian Width property 'N' -- DEVANAGARI LETTER SSA
u"\u093F") # MatraL, Category 'Mc', East Asian Width property 'N' -- DEVANAGARI VOWEL SIGN I
# 23107-terminal-suppt.pdf suggests wcwidth.wcwidth should return (2, 0, 0, 1)
expect_length_each = (1, 0, 1, 0)
# I believe the final width *should* be 3.
expect_length_phrase = 2
# exercise,
length_each = tuple(map(wcwidth.wcwidth, phrase))
length_phrase = wcwidth.wcswidth(phrase)
# verify.
assert length_each == expect_length_each
assert length_phrase == expect_length_phrase

@jquast
Copy link
Owner

jquast commented Oct 30, 2023

Also about 'त्पि', the previous release of wcwidth measured it as 3 cells, but today's release of wcwidth more correctly measures it as 2 cells.

>>> l='त्पि'
>>> len(l)
>>> [unicodedata.category(x) for x in l]
['Lo', 'Mn', 'Lo', 'Mc']
>>> [unicodedata.name(x) for x in l]
['DEVANAGARI LETTER TA', 'DEVANAGARI SIGN VIRAMA', 'DEVANAGARI LETTER PA', 'DEVANAGARI VOWEL SIGN I']
>>> [wcwidth.wcwidth(x) for x in l]
[1, 0, 1, 0]
>>> print('त्पि|\n12|')
त्पि|
12|

@siddhpant
Copy link
Author

siddhpant commented Nov 1, 2023

Maybe like the first chapter of an out-of-copyright work from gutenberg.org, for example.

Hindi Wikipedia? The license is permissive IIRC. Here's an article on Devanagari in Hindi: https://hi.wikipedia.org/wiki/देवनागरी/. It also has a table of conjuncts, as well as a small list (though the section title seems to be machine translated).

This article on Devanagari conjuncts (English) might help: https://en.wikipedia.org/wiki/Devanagari_conjuncts

Or if you want Sanskrit text: https://ambuda.org/texts/mahabharatam/1.1/. Clicking on the words, sometimes you may see how a conjunct might have formed.

Same text but with parallel transliteration: https://sacred-texts.com/hin/mbs/mbs01001.htm (spotted a typo but doesn't really matter for our use case).

Also about 'त्पि', the previous release of wcwidth measured it as 3 cells, but today's release of wcwidth more correctly measures it as 2 cells.

Thank you!

@jquast
Copy link
Owner

jquast commented Nov 8, 2023

After some time with your Wikipedia links I understand that Devanagari is a script (and unicode plane), and Sanskrit and Hindi are languages that make use of it, and, because both of those are included in UDHR used by ucs-detect, I can now say for certain that for supporting terminal emulators ("mlterm" and "kitty"), the previous release of wcwidth supports both Hindi and Sanskrit (Grantha) correctly.

Thanks for your time and attention @siddhpant you've been a great help, best wishes

@jquast jquast closed this as completed Nov 8, 2023
@siddhpant
Copy link
Author

siddhpant commented Nov 8, 2023

Thank you very much!


BTW Grantha is a name of another script. I suppose "Sanskrit (Grantha)" probably means Sanskrit written in Grantha script, and not in Devanagari.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants