Skip to content

New IndexReaderFunctions.positionLength from the norm #14433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dsmiley
Copy link
Contributor

@dsmiley dsmiley commented Apr 3, 2025

Description

Introduces org.apache.lucene.queries.function.IndexReaderFunctions#positionLength

Javadocs:

Creates a value source that returns the position length (number of terms) of a field, approximated from the "norm".

@rmuir
Copy link
Member

rmuir commented Apr 3, 2025

I think the history is just that this norm can contain arbitrary value, which before was a suboptimal encoding into a single byte. There was a ValueSource that assumed it was a single byte, so that was moved to only work with TFIDF for backwards compatibility purposes.

Elsewhere, norm was extended and generalized to be opaque 64-bit value. Depending upon the Similarity's index-time computeNorm() implementation, it might not even be possible to decode to a float.

But the default encoding was also fixed to be practical, by @jpountz, whilst still using a single byte. So in practice all the built-in Similarities use the same encoding and can work with this: it just won't work if you extend Similarity to do something else.

Any confusion can be solved with documentation:

  • should be clear that this only works, if your similarity uses the default implementation of computeNorm()
  • don't think PositionLength is a good name, norm is not that (see discountOverlaps as an example).

Also I would ask if we really need this EMPTY instance: it would be good to keep polymorphism under wraps.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 4, 2025

Thanks for the historical context!

I can definitely add more docs; I started with the bare minimum. Definitely need to emphasize a dependency on the default computeNorm formula! That documentation should also mention discountOverlaps.

I'm not married to the name; do you have ideas? If it has "term" in the name, people may be confused that the argument to the method is a term but it's just the field. Something like termLength would be confusing -- the length of what term? (no). Ultimately I like "position length" because it is the number of positions, the length, of the field. Using "length" in some way, I think, is likely to resonate with people on its use.

The code shows how EMPTY is needed (no norms). It mirrors the same for DoubleValues.EMPTY. I found it odd/surprising that it did not exist, and it's a common pattern I expect in Lucene. Is polymorphism really an issue here?

I anticipated possible doubt as to the placement of this. It's not a whole-index statistic, but it is related to the others here for there use in relevancy.

BTW I could imagine another interesting/useful utility method that takes a string (which is a query in practice), applies the index analyzer, counts the positions, and finally produces a constant and then build a constant LongValueSource from that. This would allow doing ~exact-ish matching of a query when combined with a phrase query targeting the field. Maybe nothing is needed at a Lucene level; it's a small amount of code that could be added at a higher level (like Solr).

@bruno-roustant
Copy link
Contributor

Why not numTerms() instead of positionLength()?
Inside Similarity.computeNorm(), the value is named numTerms.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 4, 2025

I'd expect a hypothetical IndexReaderFunctions.numTerms(field) to return the number of terms in the index for that field. That's not even close to what we want! "Length" should be a component of the name.

@jpountz
Copy link
Contributor

jpountz commented Apr 5, 2025

What about calling it just "field length", since this is the length as computed for the purpose of length normalization?

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 5, 2025

fieldLength works for me. I'd like fieldPositionLength more as it characterizes the basis of the length (it's not characters). BTW some other methods on this class don't have "field" in the name yet take a field arg and so are a statistic about a field.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 7, 2025

Would it make sense in this PR to add a Similarity.decodeNorm(long norm) returning an int of the field position length? It feels like the right thing to add.

*/
public long decodeNormToLength(long norm) {
return SmallFloat.byte4ToInt((byte) norm);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this method as it is impossible for someone to implement correctly if they customize just one field. The other method is per-field, this one is not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the name is wrong, there's nothing that requires this to be a position length. For some scoring methods it is something else such as the number of unique terms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should add the field name as an arg. The name is intentional -- if a Similarity can't decode the norm to a position length, it can throw UnsupportedOperationException.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this being in the similarity api, sorry, that's too hacky.

*
* @see org.apache.lucene.index.LeafReader#getNormValues(String)
*/
public static LongValuesSource fieldLength(String field) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename this as there is nothing that requires the norm to be this. for example in some scoring methods it is the number of unique terms

@rmuir
Copy link
Member

rmuir commented Apr 9, 2025

I think there is a high-level problem here, as i stated originally, that norm is not any position length. For example it may be based on FieldInvertState.getMaxTermFrequency() or FieldInvertState.getUniqueTermCount(), there are real scoring methods that use these approaches.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 9, 2025

What name would you suggest then, Rob?
There's something to be said for choosing a name that's correct for the vast majority of cases, even if hypothetically a Similarity might do something else.

@rmuir
Copy link
Member

rmuir commented Apr 10, 2025

I don't have any suggestion, I don't see the need for users to try to reimplement Similarity with valuesources.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 10, 2025

The need is to incorporate a field's position length in a composable/flexible relevance formula. A LongValues is the way to do that. I understand a Lucene user could write a custom Similarity, which may be good for advanced search teams but most search teams would prefer to use use a flexible expression of some kind (hence a LongValues implementation at the Lucene layer) to trade some potential performance for expedience & simplicity (no-code).

Maybe a name like simply norm would work, documented to be whatever the Similarity says it is. Thus it wouldn't risk misadvertising itself as something it may not be. The method I added to Similarity would then be named just decodeNorm.

Copy link
Member

@rmuir rmuir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add such changes to the similarity api that take us backwards. UOE is not an acceptable answer.

I think for now, it is best that such nonsense is contained to TFIDFSimilarity.

@dsmiley
Copy link
Contributor Author

dsmiley commented Apr 12, 2025

I'm trying to get where you're coming from...
The method on Similarity is there to keep the entire encoding information centralized to the Similarity so that the ValuesSource needn't pre presumptuous as to its encoding, and not even its meaning since renaming to simply "norm". How does that method "take us backwards"?

What is special about TFIDFSimilarity relative to this topic?

Copy link

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Apr 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants