Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Phrase relevance improvements #258 #281

Merged
merged 8 commits into from
Apr 26, 2019
Merged

Conversation

aldenstpage
Copy link
Contributor

@aldenstpage aldenstpage commented Apr 25, 2019

  • Disable constant_score query filter. This was used to prevent repetitive titles from being disproportionately highly ranked by Elasticsearch's BM25 algorithm (e.g. an image titled "Nature nature nature nature nature" would be at the top of the results for any "nature" query. This situation was very common across many queries.) While using constant_score solved the repetition problem, it really kneecapped the quality of our search in other ways, as it disables a lot of other desirable functionality used to rank search queries.
  • Set title field mapping similarity = boolean. This disables full-text search ranking for this field specifically and leaves the rest of the fields untouched. That way, the repetition problem is solved, but we can still properly rank results. Read here for more details.
  • Provide low-level access to the Elasticsearch mapping instead of using elasticsearch-dsl to create it. elasticsearch-dsl is nice for querying documents, but I found few options for customizing the document mapping. It doesn't seem to be possible to set the similarity field using their document model.

@aldenstpage aldenstpage requested a review from kgodey April 25, 2019 14:26
Copy link
Contributor

@kgodey kgodey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the Elasticsearch docs a bit and these all look like great changes.

Re: setting the similarity of the title field to boolean, is it possible to use BM25 and set the k1 tuning value to a low value just for the title field? That seems like it would keep the value of using BM25 while avoiding the saturation problem. I don't think we'd want to apply the same low value to the image's description (but we don't seem to be storing a description so that might be moot).

@aldenstpage
Copy link
Contributor Author

Tuning BM25 instead of switching to boolean similarity might provide better results, but that will take some time to test (since we have to reindex every time we adjust k1). Since the data set is large enough that it takes several days to reindex, I'd like to revisit this once we've finished #279, since we'll have a higher quality and smaller dataset available to us.

@aldenstpage aldenstpage merged commit ac8c944 into master Apr 26, 2019
@kgodey kgodey deleted the phrase-relevance-continued branch April 26, 2019 16:54
@kgodey
Copy link
Contributor

kgodey commented Apr 27, 2019

I made #288 to track the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants