Description
The expectation of the slugify(allow_unicode=True)
call in the tagging/models.py
CustomTag()
class is to pass any tag written in unicode characters through mostly untouched except for
- removal of leading and trailing spaces
- replacing spaces between words with the - character
- lower-casing the words
making sure the resulting tag is "unique"
(this is actually the job of the taggit manager code)
However, when tags are written in a Brahmic / abugida writing system (examples include Hindi, Telugu, Thai, Malayalam, Tamil, Kannada, and more) this code is mangling the result by removing the diacritical marks and vowels. Going off the Google translations here, as I am not a speaker of any of the example languages.
The slug of "हिंदी में जानकारी" ("Information in Hindi") is being returned as "-जनकर" which isn't a word. Attempting to then slugify "हिंदी-में-जानकारी" ("information-in-Hindi"), I get back "हद-म-जनकर" ("half-dead").
A similar thing seems to be happening with the Telugu language - "స్వయంచాలక" ("automated") becomes "సవయచలక" -- which isn't a word.
Additional examples:
Kannada: "ಡೇಟಾಬೇಸ್ ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ" becomes "ಡಟಬಸ-ನರವಹಣ-ವಯವಸಥ"
Malayalam: "ഡാറ്റാബേസ് മാനേജുമെന്റ് സിസ്റ്റം" becomes "ഡററബസ-മനജമനറ-സസററ"
Thai: "ฐานข้อมูล" becomes "ฐานขอมล"
Burmese: "ဒေတာဘေ့စစီမံခန့်ခွဲမှုစနစ်" becomes "ဒတဘစစမခနခမစနစ"
The real kicker here is that none of these languages really have a "lower case" vs "upper case" distinction, really.
However, slugifying the Hebrew "מערכת ניהול מסדי נתונים" ("database management system") results in the expected "מערכת-ניהול-מסדי-נתונים".
And slugifying the Arabic "قاعدة البيانات" ("Database"), results in "قاعدة-البيانات" ("Database").
Tests with traditional & simplified Chinese characters, Korean, and multiple Japanese variants are also fine, as is Persian.
We may need to move to either a combination of slugify with unicode and transliteration, do a run-around of slugify for certain languages - or scrap slugification altogether. Very open to suggestions or discussions on this.