Skip to content

[Tagging] Slugify Mangling Tags Written in Brahmic scripts / abugida writing systems (Vowels are being stripped) #123

Closed
@BethanyG

Description

@BethanyG

The expectation of the slugify(allow_unicode=True) call in the tagging/models.py CustomTag() class is to pass any tag written in unicode characters through mostly untouched except for

  1. removal of leading and trailing spaces
  2. replacing spaces between words with the - character
  3. lower-casing the words
  4. making sure the resulting tag is "unique"
    (this is actually the job of the taggit manager code)

However, when tags are written in a Brahmic / abugida writing system (examples include Hindi, Telugu, Thai, Malayalam, Tamil, Kannada, and more) this code is mangling the result by removing the diacritical marks and vowels. Going off the Google translations here, as I am not a speaker of any of the example languages.

The slug of "हिंदी में जानकारी" ("Information in Hindi") is being returned as "-जनकर" which isn't a word. Attempting to then slugify "हिंदी-में-जानकारी" ("information-in-Hindi"), I get back "हद-म-जनकर" ("half-dead").

A similar thing seems to be happening with the Telugu language - "స్వయంచాలక" ("automated") becomes "సవయచలక" -- which isn't a word.

Additional examples:

Kannada: "ಡೇಟಾಬೇಸ್ ನಿರ್ವಹಣಾ ವ್ಯವಸ್ಥೆ" becomes "ಡಟಬಸ-ನರವಹಣ-ವಯವಸಥ"
Malayalam: "ഡാറ്റാബേസ് മാനേജുമെന്റ് സിസ്റ്റം" becomes "ഡററബസ-മനജമനറ-സസററ"
Thai: "ฐานข้อมูล" becomes "ฐานขอมล"
Burmese: "ဒေတာဘေ့စစီမံခန့်ခွဲမှုစနစ်" becomes "ဒတဘစစမခနခမစနစ"

The real kicker here is that none of these languages really have a "lower case" vs "upper case" distinction, really.

However, slugifying the Hebrew "מערכת ניהול מסדי נתונים" ("database management system") results in the expected "מערכת-ניהול-מסדי-נתונים".

And slugifying the Arabic "قاعدة البيانات" ("Database"), results in "قاعدة-البيانات" ("Database").

Tests with traditional & simplified Chinese characters, Korean, and multiple Japanese variants are also fine, as is Persian.

We may need to move to either a combination of slugify with unicode and transliteration, do a run-around of slugify for certain languages - or scrap slugification altogether. Very open to suggestions or discussions on this.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedExtra attention is neededneeds discussionThe fix for this issue needs discussion

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions