-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New IndexReaderFunctions.positionLength from the norm #14433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I think the history is just that this norm can contain arbitrary value, which before was a suboptimal encoding into a single byte. There was a ValueSource that assumed it was a single byte, so that was moved to only work with TFIDF for backwards compatibility purposes. Elsewhere, norm was extended and generalized to be opaque 64-bit value. Depending upon the Similarity's index-time But the default encoding was also fixed to be practical, by @jpountz, whilst still using a single byte. So in practice all the built-in Similarities use the same encoding and can work with this: it just won't work if you extend Similarity to do something else. Any confusion can be solved with documentation:
Also I would ask if we really need this |
Thanks for the historical context! I can definitely add more docs; I started with the bare minimum. Definitely need to emphasize a dependency on the default I'm not married to the name; do you have ideas? If it has "term" in the name, people may be confused that the argument to the method is a term but it's just the field. Something like termLength would be confusing -- the length of what term? (no). Ultimately I like "position length" because it is the number of positions, the length, of the field. Using "length" in some way, I think, is likely to resonate with people on its use. The code shows how EMPTY is needed (no norms). It mirrors the same for DoubleValues.EMPTY. I found it odd/surprising that it did not exist, and it's a common pattern I expect in Lucene. Is polymorphism really an issue here? I anticipated possible doubt as to the placement of this. It's not a whole-index statistic, but it is related to the others here for there use in relevancy. BTW I could imagine another interesting/useful utility method that takes a string (which is a query in practice), applies the index analyzer, counts the positions, and finally produces a constant and then build a constant LongValueSource from that. This would allow doing ~exact-ish matching of a query when combined with a phrase query targeting the field. Maybe nothing is needed at a Lucene level; it's a small amount of code that could be added at a higher level (like Solr). |
Why not numTerms() instead of positionLength()? |
I'd expect a hypothetical |
What about calling it just "field length", since this is the length as computed for the purpose of length normalization? |
|
Would it make sense in this PR to add a |
*/ | ||
public long decodeNormToLength(long norm) { | ||
return SmallFloat.byte4ToInt((byte) norm); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this method as it is impossible for someone to implement correctly if they customize just one field. The other method is per-field, this one is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the name is wrong, there's nothing that requires this to be a position length. For some scoring methods it is something else such as the number of unique terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should add the field name as an arg. The name is intentional -- if a Similarity can't decode the norm to a position length, it can throw UnsupportedOperationException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this being in the similarity api, sorry, that's too hacky.
* | ||
* @see org.apache.lucene.index.LeafReader#getNormValues(String) | ||
*/ | ||
public static LongValuesSource fieldLength(String field) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rename this as there is nothing that requires the norm to be this. for example in some scoring methods it is the number of unique terms
I think there is a high-level problem here, as i stated originally, that norm is not any position length. For example it may be based on |
What name would you suggest then, Rob? |
I don't have any suggestion, I don't see the need for users to try to reimplement Similarity with valuesources. |
The need is to incorporate a field's position length in a composable/flexible relevance formula. A LongValues is the way to do that. I understand a Lucene user could write a custom Similarity, which may be good for advanced search teams but most search teams would prefer to use use a flexible expression of some kind (hence a LongValues implementation at the Lucene layer) to trade some potential performance for expedience & simplicity (no-code). Maybe a name like simply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add such changes to the similarity api that take us backwards. UOE is not an acceptable answer.
I think for now, it is best that such nonsense is contained to TFIDFSimilarity.
I'm trying to get where you're coming from... What is special about TFIDFSimilarity relative to this topic? |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
Introduces
org.apache.lucene.queries.function.IndexReaderFunctions#positionLength
Javadocs: