New IndexReaderFunctions.positionLength from the norm #14433

dsmiley · 2025-04-03T01:53:03Z

Description

Introduces org.apache.lucene.queries.function.IndexReaderFunctions#positionLength

Javadocs:

Creates a value source that returns the position length (number of terms) of a field, approximated from the "norm".

rmuir · 2025-04-03T02:31:34Z

I think the history is just that this norm can contain arbitrary value, which before was a suboptimal encoding into a single byte. There was a ValueSource that assumed it was a single byte, so that was moved to only work with TFIDF for backwards compatibility purposes.

Elsewhere, norm was extended and generalized to be opaque 64-bit value. Depending upon the Similarity's index-time computeNorm() implementation, it might not even be possible to decode to a float.

But the default encoding was also fixed to be practical, by @jpountz, whilst still using a single byte. So in practice all the built-in Similarities use the same encoding and can work with this: it just won't work if you extend Similarity to do something else.

Any confusion can be solved with documentation:

should be clear that this only works, if your similarity uses the default implementation of computeNorm()
don't think PositionLength is a good name, norm is not that (see discountOverlaps as an example).

Also I would ask if we really need this EMPTY instance: it would be good to keep polymorphism under wraps.

dsmiley · 2025-04-04T02:43:28Z

Thanks for the historical context!

I can definitely add more docs; I started with the bare minimum. Definitely need to emphasize a dependency on the default computeNorm formula! That documentation should also mention discountOverlaps.

I'm not married to the name; do you have ideas? If it has "term" in the name, people may be confused that the argument to the method is a term but it's just the field. Something like termLength would be confusing -- the length of what term? (no). Ultimately I like "position length" because it is the number of positions, the length, of the field. Using "length" in some way, I think, is likely to resonate with people on its use.

The code shows how EMPTY is needed (no norms). It mirrors the same for DoubleValues.EMPTY. I found it odd/surprising that it did not exist, and it's a common pattern I expect in Lucene. Is polymorphism really an issue here?

I anticipated possible doubt as to the placement of this. It's not a whole-index statistic, but it is related to the others here for there use in relevancy.

BTW I could imagine another interesting/useful utility method that takes a string (which is a query in practice), applies the index analyzer, counts the positions, and finally produces a constant and then build a constant LongValueSource from that. This would allow doing ~exact-ish matching of a query when combined with a phrase query targeting the field. Maybe nothing is needed at a Lucene level; it's a small amount of code that could be added at a higher level (like Solr).

bruno-roustant · 2025-04-04T08:09:58Z

Why not numTerms() instead of positionLength()?
Inside Similarity.computeNorm(), the value is named numTerms.

dsmiley · 2025-04-04T12:52:35Z

I'd expect a hypothetical IndexReaderFunctions.numTerms(field) to return the number of terms in the index for that field. That's not even close to what we want! "Length" should be a component of the name.

jpountz · 2025-04-05T10:12:59Z

What about calling it just "field length", since this is the length as computed for the purpose of length normalization?

dsmiley · 2025-04-05T13:47:35Z

fieldLength works for me. I'd like fieldPositionLength more as it characterizes the basis of the length (it's not characters). BTW some other methods on this class don't have "field" in the name yet take a field arg and so are a statistic about a field.

dsmiley · 2025-04-07T17:59:18Z

Would it make sense in this PR to add a Similarity.decodeNorm(long norm) returning an int of the field position length? It feels like the right thing to add.

rmuir · 2025-04-09T00:08:23Z

lucene/core/src/java/org/apache/lucene/search/similarities/Similarity.java

+   */
+  public long decodeNormToLength(long norm) {
+    return SmallFloat.byte4ToInt((byte) norm);
+  }


Please remove this method as it is impossible for someone to implement correctly if they customize just one field. The other method is per-field, this one is not.

Also the name is wrong, there's nothing that requires this to be a position length. For some scoring methods it is something else such as the number of unique terms.

I should add the field name as an arg. The name is intentional -- if a Similarity can't decode the norm to a position length, it can throw UnsupportedOperationException.

I don't agree with this being in the similarity api, sorry, that's too hacky.

rmuir · 2025-04-09T00:15:09Z

lucene/queries/src/java/org/apache/lucene/queries/function/IndexReaderFunctions.java

+   *
+   * @see org.apache.lucene.index.LeafReader#getNormValues(String)
+   */
+  public static LongValuesSource fieldLength(String field) {


please rename this as there is nothing that requires the norm to be this. for example in some scoring methods it is the number of unique terms

rmuir · 2025-04-09T00:19:35Z

I think there is a high-level problem here, as i stated originally, that norm is not any position length. For example it may be based on FieldInvertState.getMaxTermFrequency() or FieldInvertState.getUniqueTermCount(), there are real scoring methods that use these approaches.

dsmiley · 2025-04-09T00:36:18Z

What name would you suggest then, Rob?
There's something to be said for choosing a name that's correct for the vast majority of cases, even if hypothetically a Similarity might do something else.

rmuir · 2025-04-10T11:47:05Z

I don't have any suggestion, I don't see the need for users to try to reimplement Similarity with valuesources.

dsmiley · 2025-04-10T20:01:27Z

The need is to incorporate a field's position length in a composable/flexible relevance formula. A LongValues is the way to do that. I understand a Lucene user could write a custom Similarity, which may be good for advanced search teams but most search teams would prefer to use use a flexible expression of some kind (hence a LongValues implementation at the Lucene layer) to trade some potential performance for expedience & simplicity (no-code).

Maybe a name like simply norm would work, documented to be whatever the Similarity says it is. Thus it wouldn't risk misadvertising itself as something it may not be. The method I added to Similarity would then be named just decodeNorm.

rmuir

Please don't add such changes to the similarity api that take us backwards. UOE is not an acceptable answer.

I think for now, it is best that such nonsense is contained to TFIDFSimilarity.

dsmiley · 2025-04-12T02:14:24Z

I'm trying to get where you're coming from...
The method on Similarity is there to keep the entire encoding information centralized to the Similarity so that the ValuesSource needn't pre presumptuous as to its encoding, and not even its meaning since renaming to simply "norm". How does that method "take us backwards"?

What is special about TFIDFSimilarity relative to this topic?

github-actions · 2025-04-27T00:27:26Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

New IndexReaderFunctions.positionLength from the norm

3a4c8a6

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 3, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 3, 2025

github-actions bot added module:core/search module:queries labels Apr 3, 2025

dsmiley requested a review from jpountz April 3, 2025 01:53

Rename to fieldLength. Add Similarity.decodeNormToLength

11c1c1c

rmuir reviewed Apr 9, 2025

View reviewed changes

Rename to norm

3c76770

rmuir requested changes Apr 11, 2025

View reviewed changes

github-actions bot added the Stale label Apr 27, 2025

New IndexReaderFunctions.positionLength from the norm #14433

Are you sure you want to change the base?

New IndexReaderFunctions.positionLength from the norm #14433

Uh oh!

Conversation

dsmiley commented Apr 3, 2025

Description

Uh oh!

rmuir commented Apr 3, 2025

Uh oh!

dsmiley commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bruno-roustant commented Apr 4, 2025

Uh oh!

dsmiley commented Apr 4, 2025

Uh oh!

jpountz commented Apr 5, 2025

Uh oh!

dsmiley commented Apr 5, 2025

Uh oh!

dsmiley commented Apr 7, 2025

Uh oh!

rmuir Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmuir Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

dsmiley Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmuir Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

rmuir Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rmuir commented Apr 9, 2025

Uh oh!

dsmiley commented Apr 9, 2025

Uh oh!

rmuir commented Apr 10, 2025

Uh oh!

dsmiley commented Apr 10, 2025

Uh oh!

rmuir left a comment

Choose a reason for hiding this comment

Uh oh!

dsmiley commented Apr 12, 2025

Uh oh!

github-actions bot commented Apr 27, 2025

Uh oh!

Uh oh!

dsmiley commented Apr 4, 2025 •

edited

Loading