Skip to content

Add l2_norm normalization support to linear retriever #128504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

mridula-s109
Copy link
Contributor

Summary

This PR adds support for L2 normalization (l2_norm) to the linear retriever in Elasticsearch.

Changes

  • Implements a new L2ScoreNormalizer class under org.elasticsearch.xpack.rank.linear that normalizes scores so that their L2 norm is 1.
  • Registers l2_norm as a valid normalizer in the linear retriever configuration.
  • Updates YAML REST tests (10_linear_retriever.yml) to cover the new normalization method.
  • Updates documentation to include l2_norm as a supported normalizer option.

@mridula-s109 mridula-s109 requested review from ioanatia, a team and Copilot May 27, 2025 11:33
@mridula-s109 mridula-s109 added >enhancement auto-backport Automatically create backport pull requests when merged :SearchOrg/Relevance Label for the Search (solution/org) Relevance team v8.19.0 v9.1.0 Team:Search - Relevance The Search organization Search Relevance team labels May 27, 2025
@elasticsearchmachine elasticsearchmachine added the Team:SearchOrg Meta label for the Search Org (Enterprise Search) label May 27, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-eng (Team:SearchOrg)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/search-relevance (Team:Search - Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @mridula-s109, I've created a changelog YAML for you.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds L2 (Euclidean) normalization support for scores in the linear retriever, registers it in the core normalizer lookup, updates REST tests, and expands documentation.

  • Implements L2ScoreNormalizer to normalize score vectors to unit L2 norm.
  • Registers "l2_norm" in ScoreNormalizer.valueOf.
  • Adds YAML REST tests and docs entries for the new normalizer.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
x-pack/plugin/rank-rrf/src/yamlRestTest/resources/rest-api-spec/test/linear/10_linear_retriever.yml Adds a test scenario for l2_norm normalization
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/ScoreNormalizer.java Registers L2ScoreNormalizer in valueOf
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java Implements the L2 normalization logic
docs/reference/elasticsearch/rest-apis/retrievers.md Documents l2_norm as a valid normalizer option
Comments suppressed due to low confidence (1)

x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java:29

  • Add unit tests covering edge cases in normalizeScores, such as when the input array is empty, when all scores are NaN, and when the computed norm is below EPSILON, to ensure the fallback branches behave as expected.
    public ScoreDoc[] normalizeScores(ScoreDoc[] docs) {

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @mridula-s109 ! Agreed with @ioanatia 's suggestion on additional tests.

Does it make sense to add unit tests for the normalizeScores method too?

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds L2 normalization support to the linear retriever in Elasticsearch by implementing a new L2ScoreNormalizer, updating configuration resolution, and expanding tests and documentation.

  • Introduces L2ScoreNormalizer with L2 norm scaling
  • Updates ScoreNormalizer to recognize "l2_norm"
  • Adds YAML REST tests and documentation changes for the new normalizer

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
x-pack/plugin/rank-rrf/src/yamlRestTest/resources/rest-api-spec/test/linear/10_linear_retriever.yml Added YAML tests for verifying L2 normalization behavior
x-pack/plugin/rank-rrf/src/test/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizerTests.java Created test cases to validate normalization with typical, zero, and NaN scores
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/ScoreNormalizer.java Updated to support lookup of the new L2 normalizer
x-pack/plugin/rank-rrf/src/main/java/org/elasticsearch/xpack/rank/linear/L2ScoreNormalizer.java New implementation for L2 normalization of scores
docs/reference/elasticsearch/rest-apis/retrievers.md Updated documentation with the "l2_norm" option
docs/changelog/128504.yaml Changelog entry for L2 normalization support

Copy link
Contributor

@ioanatia ioanatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just one comment on the java doc that if we address this should be good to go

Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending the Javadoc comment

@mridula-s109 mridula-s109 merged commit 81fba27 into elastic:main Jun 2, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128504

mridula-s109 added a commit that referenced this pull request Jun 2, 2025
mridula-s109 added a commit that referenced this pull request Jun 2, 2025
* New l2 normalizer added

* L2 score normaliser is registered

* test case added to the yaml

* Documentation added

* Resolved checkstyle issues

* Update docs/changelog/128504.yaml

* Update docs/reference/elasticsearch/rest-apis/retrievers.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Score 0 test case added to check for corner cases

* Edited the markdown doc description

* Pruned the comment

* Renamed the variable

* Added comment to the class

* Unit tests added

* Spotless and checkstyle fixed

* Fixed build failure

* Fixed the forbidden test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
joshua-adams-1 pushed a commit to joshua-adams-1/elasticsearch that referenced this pull request Jun 3, 2025
* New l2 normalizer added

* L2 score normaliser is registered

* test case added to the yaml

* Documentation added

* Resolved checkstyle issues

* Update docs/changelog/128504.yaml

* Update docs/reference/elasticsearch/rest-apis/retrievers.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Score 0 test case added to check for corner cases

* Edited the markdown doc description

* Pruned the comment

* Renamed the variable

* Added comment to the class

* Unit tests added

* Spotless and checkstyle fixed

* Fixed build failure

* Fixed the forbidden test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ioanatia
Copy link
Contributor

ioanatia commented Jun 4, 2025

@mridula-s109 let's do a manual backport to 8.19 0 it looks like the automatic one failed

@mridula-s109
Copy link
Contributor Author

@mridula-s109 let's do a manual backport to 8.19 0 it looks like the automatic one failed

Yes on it, backporting both these PRs - #128808 now.

mridula-s109 added a commit to mridula-s109/elasticsearch that referenced this pull request Jun 5, 2025
* New l2 normalizer added

* L2 score normaliser is registered

* test case added to the yaml

* Documentation added

* Resolved checkstyle issues

* Update docs/changelog/128504.yaml

* Update docs/reference/elasticsearch/rest-apis/retrievers.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Score 0 test case added to check for corner cases

* Edited the markdown doc description

* Pruned the comment

* Renamed the variable

* Added comment to the class

* Unit tests added

* Spotless and checkstyle fixed

* Fixed build failure

* Fixed the forbidden test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Samiul-TheSoccerFan pushed a commit to Samiul-TheSoccerFan/elasticsearch that referenced this pull request Jun 5, 2025
* New l2 normalizer added

* L2 score normaliser is registered

* test case added to the yaml

* Documentation added

* Resolved checkstyle issues

* Update docs/changelog/128504.yaml

* Update docs/reference/elasticsearch/rest-apis/retrievers.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Score 0 test case added to check for corner cases

* Edited the markdown doc description

* Pruned the comment

* Renamed the variable

* Added comment to the class

* Unit tests added

* Spotless and checkstyle fixed

* Fixed build failure

* Fixed the forbidden test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
mridula-s109 added a commit that referenced this pull request Jun 11, 2025
* Add l2_norm normalization support to linear retriever (#128504)

* New l2 normalizer added

* L2 score normaliser is registered

* test case added to the yaml

* Documentation added

* Resolved checkstyle issues

* Update docs/changelog/128504.yaml

* Update docs/reference/elasticsearch/rest-apis/retrievers.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Score 0 test case added to check for corner cases

* Edited the markdown doc description

* Pruned the comment

* Renamed the variable

* Added comment to the class

* Unit tests added

* Spotless and checkstyle fixed

* Fixed build failure

* Fixed the forbidden test

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Clarify Javadoc for L2ScoreNormalizer (l2_norm) (#128808)

* propgating retrievers to inner retrievers

* Java doc fixed

* Cleaned up

* Update docs/changelog/128808.yaml

* Enhanced comment as stated by the copilot

* Delete docs/changelog/128808.yaml

* Add Cluster Feature for L2 Norm (#129181)

* propgating retrievers to inner retrievers

* test feature taken care of

* Small changes in concurrent multipart upload interfaces (#128977)

Small changes in BlobContainer interface and wrapper.

Relates ES-11815

* Unmute FollowingEngineTests#testProcessOnceOnPrimary() test (#129054)

The reason the test fails is that operations contained _seq_no field with different doc value types (with no skippers and with skippers) and this isn't allowed, since field types need to be consistent in a Lucene index.

The initial operations were generated not knowing about the fact the index mode was set to logsdb or time_series. Causing the operations to not have doc value skippers. However when replaying the operations via following engine, the operations did have doc value skippers.

The fix is to set `index.seq_no.index_options` to `points_and_doc_values`, so that the initial operations are indexed without doc value skippers.

This test doesn't gain anything from storing seqno with doc value skippers, so there is no loss of testing coverage.

Closes #128541

* [Build] Add support for publishing to maven central (#128659)

This ensures we package an aggregation zip with all artifacts we want to publish to maven central as part of a release.
Running zipAggregation will produce a zip file in the build/nmcp/zip folder. The content of this zip is meant to match the maven artifacts we have currently declared as dra maven artifacts.

* ESQL: Check for errors while loading blocks (#129016)

Runs a sanity check after loading a block of values. Previously we were
doing a quick check if assertions were enabled. Now we do two quick
checks all the time. Better - we attach information about how a block
was loaded when there's a problem.

Relates to #128959

* Make `PhaseCacheManagementTests` project-aware (#129047)

The functionality in `PhaseCacheManagement` was already project-aware,
but these tests were still using deprecated methods.

* Vector test tools (#128934)

This adds some testing tools for verifying vector recall and latency
directly without having to spin up an entire ES node and running a rally
track.

Its pretty barebones and takes inspiration from lucene-util, but I
wanted access to our own formats and tooling to make our lives easier.

Here is an example config file. This will build the initial index, run
queries at num_candidates: 50, then again at num_candidates 100 (without
reindexing, and re-using the cached nearest neighbors).

```
[{
  "doc_vectors" : "path",
  "query_vectors" : "path",
  "num_docs" : 10000,
  "num_queries" : 10,
  "index_type" : "hnsw",
  "num_candidates" : 50,
  "k" : 10,
  "hnsw_m" : 16,
  "hnsw_ef_construction" : 200,
  "index_threads" : 4,
  "reindex" : true,
  "force_merge" : false,
  "vector_space" : "maximum_inner_product",
  "dimensions" : 768
},
{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 100,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
}
]
```

To execute:

```
./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json"
```

Calling `./gradlew :qa:vector:checkVecHelp` gives some guidance on how
to use it, additionally providing a way to run it via java directly
(useful to bypass gradlew guff).

* ES|QL: refactor generative tests (#129028)

* Add a test of LOOKUP JOIN against a time series index (#129007)

Add a spec test of `LOOKUP JOIN` against a time series index.

* Make ILM `ClusterStateWaitStep` project-aware (#129042)

This is part of an iterative process to make ILM project-aware.

* Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {lookup-join.LookupJoinOnTimeSeriesIndex ASYNC} #129078

* Remove `ClusterState` param from ILM `AsyncBranchingStep` (#129076)

The `ClusterState` parameter of the `asyncPredicate` is not used
anywhere.

* Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {lookup-join.LookupJoinOnTimeSeriesIndex SYNC} #129082

* Mute org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/70_ilm/Test Lifecycle Still There And Indices Are Still Managed} #129097

* Mute org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/90_ml_data_frame_analytics_crud/Get mixed cluster outlier_detection job} #129098

* Mute org.elasticsearch.packaging.test.DockerTests test081SymlinksAreFollowedWithEnvironmentVariableFiles #128867

* Threadpool merge executor is aware of available disk space (#127613)

This PR introduces 3 new settings:
indices.merge.disk.check_interval, indices.merge.disk.watermark.high, and indices.merge.disk.watermark.high.max_headroom
that control if the threadpool merge executor starts executing new merges when the disk space is getting low.

The intent of this change is to avoid the situation where in-progress merges exhaust the available disk space on the node's local filesystem.
To this end, the thread pool merge executor periodically monitors the available disk space, as well as the current disk space estimates required by all in-progress (currently running) merges on the node, and will NOT schedule any new merges if the disk space is getting low (by default below the 5% limit of the total disk space, or 100 GB, whichever is smaller (same as the disk allocation flood stage level)).

* Add option to include or exclude vectors from _source retrieval (#128735)

This PR introduces a new include_vectors option to the _source retrieval context.
When set to false, vectors are excluded from the returned _source.
This is especially efficient when used with synthetic source, as it avoids loading vector fields entirely.

By default, vectors remain included unless explicitly excluded.

* Remove direct minScore propagation to inner retrievers

* cleaned up skip

* Mute org.elasticsearch.index.engine.ThreadPoolMergeExecutorServiceDiskSpaceTests testAvailableDiskSpaceMonitorWhenFileSystemStatErrors #129149

* Add transport version for ML inference Mistral chat completion (#129033)

* Add transport version for ML inference Mistral chat completion

* Add changelog for Mistral Chat Completion version fix

* Revert "Add changelog for Mistral Chat Completion version fix"

This reverts commit 7a57416.

* Correct index path validation (#129144)

All we care about is if reindex is true or false. We shouldn't worry
about force merge. Because if reindex is true, we will create the
directory, if its false, we won't.

* Mute org.elasticsearch.index.engine.ThreadPoolMergeExecutorServiceDiskSpaceTests testUnavailableBudgetBlocksNewMergeTasksFromStartingExecution #129148

* Implemented completion task for Google VertexAI  (#128694)

* Google Vertex AI completion model, response entity and tests

* Fixed GoogleVertexAiServiceTest for Service configuration

* Changelog

* Removed downcasting and using `moveToFirstToken`

* Create GoogleVertexAiChatCompletionResponseHandler for streaming and non streaming responses

* Added unit tests

* PR feedback

* Removed googlevertexaicompletion model. Using just GoogleVertexAiChatCompletionModel for completion and chat completion

* Renamed uri -> nonStreamingUri. Added streamingUri and getters in GoogleVertexAiChatCompletionModel

* Moved rateLimitGroupHashing to subclasses of GoogleVertexAiModel

* Fixed rate limit has of GoogleVertexAiRerankModel and refactored uri for GoogleVertexAiUnifiedChatCompletionRequest

---------

Co-authored-by: lhoet-google <lhoet@google.com>
Co-authored-by: Jonathan Buttner <56361221+jonathan-buttner@users.noreply.github.com>

* Added cluster feature to yaml

* Node feature added

* Duplicate line - result of merge removed

* Update docs/changelog/129181.yaml

* Update 129181.yaml

---------

Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Rene Groeschke <rene@elastic.co>
Co-authored-by: Nik Everett <nik9000@gmail.com>
Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Luigi Dell'Aquila <luigi.dellaquila@gmail.com>
Co-authored-by: Bogdan Pintea <bogdan.pintea@elastic.co>
Co-authored-by: elasticsearchmachine <58790826+elasticsearchmachine@users.noreply.github.com>
Co-authored-by: Albert Zaharovits <email+github@zalbert.me>
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Co-authored-by: Jan-Kazlouski-elastic <jan.kazlouski@elastic.co>
Co-authored-by: Leonardo Hoet <55866308+leo-hoet@users.noreply.github.com>
Co-authored-by: lhoet-google <lhoet@google.com>
Co-authored-by: Jonathan Buttner <56361221+jonathan-buttner@users.noreply.github.com>

* Remove changelog for 129181, keep only 128504.yaml as the changelog entry

* Remove redundant retrievers.md, documentation is now in retrievers-overview.asciidoc

* updated retriever-overview.asciidoc

* Resolved duplicate tag issue

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Tanguy Leroux <tlrx.dev@gmail.com>
Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com>
Co-authored-by: Rene Groeschke <rene@elastic.co>
Co-authored-by: Nik Everett <nik9000@gmail.com>
Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Luigi Dell'Aquila <luigi.dellaquila@gmail.com>
Co-authored-by: Bogdan Pintea <bogdan.pintea@elastic.co>
Co-authored-by: elasticsearchmachine <58790826+elasticsearchmachine@users.noreply.github.com>
Co-authored-by: Albert Zaharovits <email+github@zalbert.me>
Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
Co-authored-by: Jan-Kazlouski-elastic <jan.kazlouski@elastic.co>
Co-authored-by: Leonardo Hoet <55866308+leo-hoet@users.noreply.github.com>
Co-authored-by: lhoet-google <lhoet@google.com>
Co-authored-by: Jonathan Buttner <56361221+jonathan-buttner@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >enhancement :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:SearchOrg Meta label for the Search Org (Enterprise Search) v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants