Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Spanish analyzer not normalizing all accented words. #1956

Closed
svera opened this issue Jan 8, 2024 · 3 comments · Fixed by #1957
Closed

[Bug] Spanish analyzer not normalizing all accented words. #1956

svera opened this issue Jan 8, 2024 · 3 comments · Fixed by #1957

Comments

@svera
Copy link
Contributor

svera commented Jan 8, 2024

Hi,
I've been working with the spanish analyzer in order to index documents in spanish, when I found and issue with some accented words which are not normalised as they should be. This is easily reproducible using the Bleve text analysis wizard, choosing the es analyzer and putting fría or guía in the Text to analyze input box. The words are kept as they are, however, using plural forms of these words (guías and frías) works as expected.

Other accented words such as tentación or comeré are correctly stemmed to coleccion and comer, respectively.

@svera
Copy link
Contributor Author

svera commented Jan 8, 2024

I've found where the issue comes from. Looking at bleve/analysis/lang/es/light_stemmer_es.go, the normalization of accented letters only happens if the input is larger than 5 characters, something that neither guía nor fría comply.
The solution would be to always execute the accented characters normalization, by moving it to a separate file just like it is done in the german analyzer, or maybe use the asciifolding filter.

@svera
Copy link
Contributor Author

svera commented Jan 8, 2024

PR opened addressing this issue: #1957

@abhinavdangeti
Copy link
Member

Thanks for raising the pull request @svera . The team will review soon.

abhinavdangeti pushed a commit that referenced this issue Jan 10, 2024
Normalization of accented letters only happens if the input is larger
than 5 characters, something that, for example, neither `guía` nor
`fría` comply.
The solution would be to always execute the accented characters
normalization, by moving it to a separate file just like it is done in
the german analyzer.

Fixes: #1956
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants