Change text analysis algorithm used to fix stemming issues #653
Labels
💻 aspect: code
Concerns the software code in the repository
✨ goal: improvement
Improvement to an existing user-facing feature
🟨 priority: medium
Not blocking but should be addressed soon
🧱 stack: api
Related to the Django API
Problem
We have a whole host of weird results you can get due to stemming issues. Some of these are already explicitly covered in the text analysis configuration, specifically for "animal": https://github.com/WordPress/openverse-api/blob/478f50cdff2834d602a3df8c198f708e010e6685/ingestion_server/ingestion_server/es_mapping.py#L19-L29
Description
I suspect there must be a better way to handle this with ES than manually adding stemmer overrides for things. Why does a search for
news
for example, prioritise so many results for "New York" or other "new" things that aren't "news" when we certainly have results that match exactly "news"? (It actually seems impossible, at the moment, to construct a query that will actually retrieve things only with the exact form "news").Right now we use the default "snowball" stemmer (Porter algorithm). We could switch to Porter2 (snowball calls this "English", but ES calls it
porter2
) and it would at least solve some of these stemming issues: https://snowballstem.org/algorithms/english/stemmer.htmlHere's a JavaScript demo of porter2 and you can see
news
works as expected (it's actually an exceptional condition in porter2 that the original porter algorithm ignores). "animate" and "animal" however still stem down to "anim" so we can't get rid of those stemmer overrides (frustrating!).Aside from changing the stemmer algorithm though, we could also try to see if there's a way to stop stemming from occuring on exact match searches... or fix exact matches. Either they're not working, or stemming still occurs. If you search for
"news"
I would expect it to not stem, but currently you'll still get a ton of "New York" type results. I think we should do this in addition to exploring different English stemming algorithms.Additional context
The text was updated successfully, but these errors were encountered: