[Tagging] Slugify Mangling Tags Written in Brahmic scripts / abugida writing systems (Vowels are being stripped)

The expectation of the `slugify(allow_unicode=True)` call in the `tagging/models.py` **`CustomTag()`** class is to pass any tag written in unicode characters through mostly untouched except for 
1.  removal of leading and trailing spaces
2.  replacing spaces between words with the **-** character
3.  lower-casing the words
4. ~making sure the resulting tag is "unique"~ 
   (_this is actually the job of the taggit manager code_)

_________________________

However, when tags are written in a Brahmic / abugida writing system (examples include Hindi,  Telugu, Thai, Malayalam, Tamil, Kannada, and more) this code is mangling the result by removing the diacritical marks and vowels.  Going off the Google translations here, as I am not a speaker of any of the example languages.

The slug of **"हिंदी में जानकारी"** ("Information in Hindi") is being returned as  **"-जनकर"** which isn't a word.  Attempting to then slugify **"हिंदी-में-जानकारी"** ("information-in-Hindi"), I get back **"हद-म-जनकर"** ("half-dead").   

A similar thing seems to be happening with the Telugu language - **"స్వయంచాలక"** ("automated") becomes **"సవయచలక"** -- which isn't a word. 

Additional examples:

Kannada:    **"ಡೇಟಾಬೇಸ್ ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ"** becomes  **"ಡಟಬಸ-ನರವಹಣ-ವಯವಸಥ"**
Malayalam: **"ഡാറ്റാബേസ് മാനേജുമെന്റ് സിസ്റ്റം"** becomes  **"ഡററബസ-മനജമനറ-സസററ"**
Thai:            **"ฐานข้อมูล"** becomes **"ฐานขอมล"**
Burmese:    **"ဒေတာဘေ့စစီမံခန့်ခွဲမှုစနစ်"** becomes **"ဒတဘစစမခနခမစနစ"**

The real kicker here is that none of these languages really have a "lower case" vs "upper case" distinction, really.

However, slugifying  the Hebrew "מערכת ניהול מסדי נתונים" ("database management system") results in the expected "מערכת-ניהול-מסדי-נתונים". 

And slugifying the Arabic "قاعدة البيانات" ("Database"), results in  "قاعدة-البيانات" ("Database").

Tests with traditional & simplified Chinese characters, Korean, and multiple Japanese variants are also fine, as is Persian.

We may need to move to either a combination of slugify with unicode and transliteration, do a run-around of slugify for certain languages - or scrap slugification altogether.  Very open to suggestions or discussions on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Tagging] Slugify Mangling Tags Written in Brahmic scripts / abugida writing systems (Vowels are being stripped) #123

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Tagging] Slugify Mangling Tags Written in Brahmic scripts / abugida writing systems (Vowels are being stripped) #123

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions