Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change text analysis algorithm used to fix stemming issues #653

Open
sarayourfriend opened this issue Feb 7, 2023 · 0 comments
Open

Change text analysis algorithm used to fix stemming issues #653

sarayourfriend opened this issue Feb 7, 2023 · 0 comments
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API

Comments

@sarayourfriend
Copy link
Contributor

Problem

We have a whole host of weird results you can get due to stemming issues. Some of these are already explicitly covered in the text analysis configuration, specifically for "animal": https://github.com/WordPress/openverse-api/blob/478f50cdff2834d602a3df8c198f708e010e6685/ingestion_server/ingestion_server/es_mapping.py#L19-L29

Description

I suspect there must be a better way to handle this with ES than manually adding stemmer overrides for things. Why does a search for news for example, prioritise so many results for "New York" or other "new" things that aren't "news" when we certainly have results that match exactly "news"? (It actually seems impossible, at the moment, to construct a query that will actually retrieve things only with the exact form "news").

Right now we use the default "snowball" stemmer (Porter algorithm). We could switch to Porter2 (snowball calls this "English", but ES calls it porter2) and it would at least solve some of these stemming issues: https://snowballstem.org/algorithms/english/stemmer.html

Here's a JavaScript demo of porter2 and you can see news works as expected (it's actually an exceptional condition in porter2 that the original porter algorithm ignores). "animate" and "animal" however still stem down to "anim" so we can't get rid of those stemmer overrides (frustrating!).

Aside from changing the stemmer algorithm though, we could also try to see if there's a way to stop stemming from occuring on exact match searches... or fix exact matches. Either they're not working, or stemming still occurs. If you search for "news" I would expect it to not stem, but currently you'll still get a ton of "New York" type results. I think we should do this in addition to exploring different English stemming algorithms.

Additional context

@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository labels Feb 7, 2023
@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
@obulat obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon 🧱 stack: api Related to the Django API
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants