Diacritic-independent preset search #8242

quincylvania · 2020-12-09T21:30:42Z

I'm told that in languages where accents and other diacritic marks are common, software users are accustomed to leaving them off when searching. iD handles some of these cases incidentally when searching presets, either due to fuzzy matching or generous search terms, but in other cases it clearly fails.

Luckily, it's straightforward in ES6 to strip diacritics from strings before comparing them. This will be a great localization improvement and also let translators reduce repetitive search terms.

quincylvania · 2020-12-09T21:56:17Z

I forgot to mention that this has come up before in #3979. At the time the solution was just to add search terms, but now we can solve the general case by taking advantage of modern JavaScript. Thanks to @willemarcel for bringing up this issue.

quincylvania · 2020-12-09T22:17:07Z

So in my implementation I made the search ignore diacritics in names altogether, which I'm afraid could cause issues if there are two preset names that are otherwise identical except for different combinations of diacritics. @1ec5 do you have any idea if this would become a problem?

I also left search terms alone, but I should probably make them work the same as the preset name.

1ec5 · 2020-12-10T03:50:21Z

So in my implementation I made the search ignore diacritics in names altogether, which I'm afraid could cause issues if there are two preset names that are otherwise identical except for different combinations of diacritics. @1ec5 do you have any idea if this would become a problem?

This is a concern specifically in Vietnamese, which abounds in minimal pairs after stripping diacritics. There are lots of collisions when stripping diacritics from both the search terms and the preset names, especially when allowing for some edit distance.

I’ve been manually hard-coding diacritic-stripped variants at the end of every list of terms in the Vietnamese localization: #3979 (comment) #3159 (comment). That way, the user can get the behavior you’ve implemented automatically so that “dau” will match both “dấu” and “đậu”, but they can easily turn it into a search for “đậu” to narrow down the results. It isn’t perfect, because Vietnamese has two levels of diacritics – “lô” should match “lỗ” but not “lò”. The weighting was also quite imperfect, because some lists of terms are naturally longer than others. But it was the best I could do with what iD supported at the time.

With this change, iD does yield better results in some cases when omitting diacritics: “sân chơi” returns “sân chơi” first instead of “sân giải trí” (which has “sân chơi” as a synonym). However, if I type in “dau” expecting “đậu”, there’s no way to filter out “dấu” except by typing exactly “đậu”. So the results will be better sometimes but worse in other cases.

It would actually be quite easy to write a small amount of Vietnamese-specific logic to avoid the bloat from hard-coded diacritic stripping while improving results in every case. However, there didn’t seem to be any appetite for language-specific special cases in the past. On native platforms, the CLDR and ICU libraries make it possible to perform searches with more nuance. But the last time I checked, the Intl API in JavaScript didn’t expose quite enough ICU-based internationalization functionality for diacritic-insensitive search, only sorting.

… preset names (re: #8242)

1ec5 · 2020-12-23T00:24:19Z

Now that this is in place, the preexisting diacritic stripping is redundant (and probably not as good for performance, given where it is):

iD/modules/util/util.js

Lines 415 to 416 in 655c3a6

    
           a = removeDiacritics(a.toLowerCase()); 
        
           b = removeDiacritics(b.toLowerCase());

quincylvania · 2020-12-23T14:54:54Z

@1ec5 Ah, I didn't realize iD was doing this.

quincylvania added the localization Adapting iD across languages, regions, and cultures label Dec 9, 2020

quincylvania added this to the 2.20.0 milestone Dec 9, 2020

quincylvania self-assigned this Dec 9, 2020

quincylvania added the preset An issue with an OpenStreetMap preset or tag label Dec 9, 2020

quincylvania closed this as completed in b3ad282 Dec 9, 2020

quincylvania added a commit that referenced this issue Dec 21, 2020

Honor diacritics on preset search value but still compare to stripped…

2591a13

… preset names (re: #8242)

quincylvania added a commit that referenced this issue Dec 21, 2020

Fix preset search result sorting (re: #8242)

6cd81df

mbrzakovic mentioned this issue Jul 26, 2021

Update to iD v2.20.0 openstreetmap/openstreetmap-website#3270

Merged

1ec5 mentioned this issue May 28, 2022

"parking lot" should find amenity=parking openstreetmap/id-tagging-schema#461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diacritic-independent preset search #8242

Diacritic-independent preset search #8242

quincylvania commented Dec 9, 2020

quincylvania commented Dec 9, 2020

quincylvania commented Dec 9, 2020 •

edited

Loading

1ec5 commented Dec 10, 2020 •

edited

Loading

1ec5 commented Dec 23, 2020

quincylvania commented Dec 23, 2020

Diacritic-independent preset search #8242

Diacritic-independent preset search #8242

Comments

quincylvania commented Dec 9, 2020

quincylvania commented Dec 9, 2020

quincylvania commented Dec 9, 2020 • edited Loading

1ec5 commented Dec 10, 2020 • edited Loading

1ec5 commented Dec 23, 2020

quincylvania commented Dec 23, 2020

quincylvania commented Dec 9, 2020 •

edited

Loading

1ec5 commented Dec 10, 2020 •

edited

Loading