Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diacritic-independent preset search #8242

Closed
quincylvania opened this issue Dec 9, 2020 · 5 comments
Closed

Diacritic-independent preset search #8242

quincylvania opened this issue Dec 9, 2020 · 5 comments
Assignees
Labels
localization Adapting iD across languages, regions, and cultures preset An issue with an OpenStreetMap preset or tag
Milestone

Comments

@quincylvania
Copy link
Collaborator

I'm told that in languages where accents and other diacritic marks are common, software users are accustomed to leaving them off when searching. iD handles some of these cases incidentally when searching presets, either due to fuzzy matching or generous search terms, but in other cases it clearly fails.

Luckily, it's straightforward in ES6 to strip diacritics from strings before comparing them. This will be a great localization improvement and also let translators reduce repetitive search terms.

Screen Shot 2020-12-09 at 4 22 10 PM Screen Shot 2020-12-09 at 4 22 23 PM

@quincylvania quincylvania added the localization Adapting iD across languages, regions, and cultures label Dec 9, 2020
@quincylvania quincylvania added this to the 2.20.0 milestone Dec 9, 2020
@quincylvania quincylvania self-assigned this Dec 9, 2020
@quincylvania
Copy link
Collaborator Author

I forgot to mention that this has come up before in #3979. At the time the solution was just to add search terms, but now we can solve the general case by taking advantage of modern JavaScript. Thanks to @willemarcel for bringing up this issue.

@quincylvania quincylvania added the preset An issue with an OpenStreetMap preset or tag label Dec 9, 2020
@quincylvania
Copy link
Collaborator Author

quincylvania commented Dec 9, 2020

So in my implementation I made the search ignore diacritics in names altogether, which I'm afraid could cause issues if there are two preset names that are otherwise identical except for different combinations of diacritics. @1ec5 do you have any idea if this would become a problem?

I also left search terms alone, but I should probably make them work the same as the preset name.

@1ec5
Copy link
Collaborator

1ec5 commented Dec 10, 2020

So in my implementation I made the search ignore diacritics in names altogether, which I'm afraid could cause issues if there are two preset names that are otherwise identical except for different combinations of diacritics. @1ec5 do you have any idea if this would become a problem?

This is a concern specifically in Vietnamese, which abounds in minimal pairs after stripping diacritics. There are lots of collisions when stripping diacritics from both the search terms and the preset names, especially when allowing for some edit distance.

I’ve been manually hard-coding diacritic-stripped variants at the end of every list of terms in the Vietnamese localization: #3979 (comment) #3159 (comment). That way, the user can get the behavior you’ve implemented automatically so that “dau” will match both “dấu” and “đậu”, but they can easily turn it into a search for “đậu” to narrow down the results. It isn’t perfect, because Vietnamese has two levels of diacritics – “lô” should match “lỗ” but not “lò”. The weighting was also quite imperfect, because some lists of terms are naturally longer than others. But it was the best I could do with what iD supported at the time.

With this change, iD does yield better results in some cases when omitting diacritics: “sân chơi” returns “sân chơi” first instead of “sân giải trí” (which has “sân chơi” as a synonym). However, if I type in “dau” expecting “đậu”, there’s no way to filter out “dấu” except by typing exactly “đậu”. So the results will be better sometimes but worse in other cases.

It would actually be quite easy to write a small amount of Vietnamese-specific logic to avoid the bloat from hard-coded diacritic stripping while improving results in every case. However, there didn’t seem to be any appetite for language-specific special cases in the past. On native platforms, the CLDR and ICU libraries make it possible to perform searches with more nuance. But the last time I checked, the Intl API in JavaScript didn’t expose quite enough ICU-based internationalization functionality for diacritic-insensitive search, only sorting.

@1ec5
Copy link
Collaborator

1ec5 commented Dec 23, 2020

Now that this is in place, the preexisting diacritic stripping is redundant (and probably not as good for performance, given where it is):

iD/modules/util/util.js

Lines 415 to 416 in 655c3a6

a = removeDiacritics(a.toLowerCase());
b = removeDiacritics(b.toLowerCase());

@quincylvania
Copy link
Collaborator Author

@1ec5 Ah, I didn't realize iD was doing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
localization Adapting iD across languages, regions, and cultures preset An issue with an OpenStreetMap preset or tag
Projects
None yet
Development

No branches or pull requests

2 participants