-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diacritic-independent preset search #8242
Comments
I forgot to mention that this has come up before in #3979. At the time the solution was just to add search terms, but now we can solve the general case by taking advantage of modern JavaScript. Thanks to @willemarcel for bringing up this issue. |
So in my implementation I made the search ignore diacritics in names altogether, which I'm afraid could cause issues if there are two preset names that are otherwise identical except for different combinations of diacritics. @1ec5 do you have any idea if this would become a problem? I also left search terms alone, but I should probably make them work the same as the preset name. |
This is a concern specifically in Vietnamese, which abounds in minimal pairs after stripping diacritics. There are lots of collisions when stripping diacritics from both the search terms and the preset names, especially when allowing for some edit distance. I’ve been manually hard-coding diacritic-stripped variants at the end of every list of terms in the Vietnamese localization: #3979 (comment) #3159 (comment). That way, the user can get the behavior you’ve implemented automatically so that “dau” will match both “dấu” and “đậu”, but they can easily turn it into a search for “đậu” to narrow down the results. It isn’t perfect, because Vietnamese has two levels of diacritics – “lô” should match “lỗ” but not “lò”. The weighting was also quite imperfect, because some lists of terms are naturally longer than others. But it was the best I could do with what iD supported at the time. With this change, iD does yield better results in some cases when omitting diacritics: “sân chơi” returns “sân chơi” first instead of “sân giải trí” (which has “sân chơi” as a synonym). However, if I type in “dau” expecting “đậu”, there’s no way to filter out “dấu” except by typing exactly “đậu”. So the results will be better sometimes but worse in other cases. It would actually be quite easy to write a small amount of Vietnamese-specific logic to avoid the bloat from hard-coded diacritic stripping while improving results in every case. However, there didn’t seem to be any appetite for language-specific special cases in the past. On native platforms, the CLDR and ICU libraries make it possible to perform searches with more nuance. But the last time I checked, the |
Now that this is in place, the preexisting diacritic stripping is redundant (and probably not as good for performance, given where it is): Lines 415 to 416 in 655c3a6
|
@1ec5 Ah, I didn't realize iD was doing this. |
I'm told that in languages where accents and other diacritic marks are common, software users are accustomed to leaving them off when searching. iD handles some of these cases incidentally when searching presets, either due to fuzzy matching or generous search terms, but in other cases it clearly fails.
Luckily, it's straightforward in ES6 to strip diacritics from strings before comparing them. This will be a great localization improvement and also let translators reduce repetitive search terms.
The text was updated successfully, but these errors were encountered: