Skip to content

deterministic score tiebreaker#130

Open
missinglink wants to merge 1 commit intomasterfrom
deterministic-scoring
Open

deterministic score tiebreaker#130
missinglink wants to merge 1 commit intomasterfrom
deterministic-scoring

Conversation

@missinglink
Copy link
Member

the current scoring algorithm sorts documents with the exact same score in a non-deterministic way.
this makes the tests brittle and jittery, this PR aims to resolve this by adding a second sorting condition to 'break the tie'.

@emacgillavry
Copy link

@missinglink hitting this wall in the case when addresses along a street are returned. I'd expect these to be available in numeric order (Dorpstraat 1, Dorpstraat 2, Dorpstraat 3, Dorpstraat 4...), but somehow Dorpstraat 3 only shows up later in the results. The _id field would depend on the insert order?

@missinglink
Copy link
Member Author

missinglink commented Nov 23, 2022

Hi @emacgillavry, in the case where results have the exact same score then the order of results is non-deterministic.

It seems that the order is consistent for the same build but inconsistent between builds, I believe this is because of the internal segment sequence assigned to each document rater than the _id.

The linked PR adds _id as a second sorting condition with the aim of making scoring deterministic between builds, but the problem is that any field used for scoring would need the doc_values option enabled.

Doc values take up a fair bit of RAM and since the source-id field would have few duplicates, it wouldn't lend itself to compression and therefore take a lot of RAM.

Using _id also wouldn't solve your specific issue, but using the address house number field in DESC sorting should work.

I don't have the bandwidth right now to do the memory and performance testing required to change this, but hopefully that helps to explain what's going on.

@missinglink
Copy link
Member Author

I'd be interested to see the query you're using to test and what other geocoding engines do, I'm not sure sorting DESC is actually the best idea, some engines seem to show them in order of importance, so prominent address on the street (such as businesses) come first in results

@emacgillavry
Copy link

Thnx @missinglink for your explanation! Sorry for having high-jacked this issue. We're simply searching (autocomplete and search) addresses within a locality (&boundary.gid=whosonfirst:locality:), that we've imported using the OpenAddresses importer. Boosting some business addresses based on popularity would be an added benefit. In case these are just residential addresses, we'd like to show these in descending order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants