[Bug] Spanish analyzer not normalizing all accented words. #1956

svera · 2024-01-08T11:56:50Z

Hi,
I've been working with the spanish analyzer in order to index documents in spanish, when I found and issue with some accented words which are not normalised as they should be. This is easily reproducible using the Bleve text analysis wizard, choosing the es analyzer and putting fría or guía in the Text to analyze input box. The words are kept as they are, however, using plural forms of these words (guías and frías) works as expected.

Other accented words such as tentación or comeré are correctly stemmed to coleccion and comer, respectively.

The text was updated successfully, but these errors were encountered:

svera · 2024-01-08T14:49:47Z

I've found where the issue comes from. Looking at bleve/analysis/lang/es/light_stemmer_es.go, the normalization of accented letters only happens if the input is larger than 5 characters, something that neither guía nor fría comply.
The solution would be to always execute the accented characters normalization, by moving it to a separate file just like it is done in the german analyzer, or maybe use the asciifolding filter.

svera · 2024-01-08T17:33:14Z

PR opened addressing this issue: #1957

abhinavdangeti · 2024-01-08T17:34:52Z

Thanks for raising the pull request @svera . The team will review soon.

Normalization of accented letters only happens if the input is larger than 5 characters, something that, for example, neither `guía` nor `fría` comply. The solution would be to always execute the accented characters normalization, by moving it to a separate file just like it is done in the german analyzer. Fixes: #1956

abhinavdangeti mentioned this issue Jan 8, 2024

Fixed spanish accents normalization #1957

Merged

abhinavdangeti closed this as completed in #1957 Jan 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Spanish analyzer not normalizing all accented words. #1956

[Bug] Spanish analyzer not normalizing all accented words. #1956

svera commented Jan 8, 2024 •

edited

Loading

svera commented Jan 8, 2024 •

edited

Loading

svera commented Jan 8, 2024

abhinavdangeti commented Jan 8, 2024

[Bug] Spanish analyzer not normalizing all accented words. #1956

[Bug] Spanish analyzer not normalizing all accented words. #1956

Comments

svera commented Jan 8, 2024 • edited Loading

svera commented Jan 8, 2024 • edited Loading

svera commented Jan 8, 2024

abhinavdangeti commented Jan 8, 2024

svera commented Jan 8, 2024 •

edited

Loading

svera commented Jan 8, 2024 •

edited

Loading