Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the force merge with Quantization failures when a segment has deleted docs in it #2046

Merged
merged 1 commit into from
Sep 5, 2024

Conversation

navneet1v
Copy link
Collaborator

@navneet1v navneet1v commented Sep 5, 2024

Description

Fix the force merge with Quantization failures when a segment has deleted docs in it.

Issue:

When documents are deleted from the segment and segments are force merged, the FlatVectorValues are getting merged to 1 vectorValues but that vectorvalue.getLiveDocs() doesn't provide correct live docs but it provides the cost which includes the deleted docs too.

Due to this when we are doing quantization, we are hitting the vectors which are not present and leading to NPE.

org.apache.lucene.index.MergePolicy$MergeException: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge
        at org.opensearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:2494) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:982) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot merge
        at org.apache.lucene.index.IndexWriter.hasPendingMerges(IndexWriter.java:2436) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.index.engine.InternalEngine$EngineMergeScheduler.afterMerge(InternalEngine.java:2452) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.OpenSearchConcurrentMergeScheduler.doMerge(OpenSearchConcurrentMergeScheduler.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
Caused by: java.lang.NullPointerException: Cannot read field "values" because "this.current" is null
        at org.apache.lucene.codecs.KnnVectorsWriter$MergedVectorValues$MergedFloat32VectorValues.vectorValue(KnnVectorsWriter.java:225) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.knn.index.vectorvalues.VectorValueExtractorStrategy$DISIVectorExtractor.extract(VectorValueExtractorStrategy.java:78) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.VectorValueExtractorStrategy.extractFloatVector(VectorValueExtractorStrategy.java:32) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.KNNFloatVectorValues.getVector(KNNFloatVectorValues.java:26) ~[?:?]
        at org.opensearch.knn.index.vectorvalues.KNNFloatVectorValues.getVector(KNNFloatVectorValues.java:19) ~[?:?]
        at org.opensearch.knn.index.quantizationservice.KNNVectorQuantizationTrainingRequest.getVectorAtThePosition(KNNVectorQuantizationTrainingRequest.java:53) ~[?:?]
        at org.opensearch.knn.quantization.quantizer.QuantizerHelper.calculateMeanThresholds(QuantizerHelper.java:35) ~[?:?]
        at org.opensearch.knn.quantization.quantizer.OneBitScalarQuantizer.train(OneBitScalarQuantizer.java:63) ~[?:?]
        at org.opensearch.knn.index.quantizationservice.QuantizationService.train(QuantizationService.java:69) ~[?:?]
        at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.trainAndIndex(NativeEngines990KnnVectorsWriter.java:262) ~[?:?]
        at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.mergeOneField(NativeEngines990KnnVectorsWriter.java:104) ~[?:?]
        at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(PerFieldKnnVectorsFormat.java:121) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.codecs.KnnVectorsWriter.merge(KnnVectorsWriter.java:99) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.mergeVectorValues(SegmentMerger.java:282) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:325) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:171) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5293) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4761) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6582) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:660) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
        at org.opensearch.index.engine.OpenSearchConcurrentMergeScheduler.doMerge(OpenSearchConcurrentMergeScheduler.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:721) ~[lucene-core-9.12.0-snapshot-847316d.jar:9.12.0-snapshot-847316d 847316dd1eec2258ccf5da52dca51c01b9df1c4c - 2024-06-27 17:07:28]
[2024-09-05T02:44:02,224][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-0] [my-knn-index-1][0] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [my-knn-index-1][0]: Recovery failed on {integTest-0}{FZH6cEHdT167xfsFW4y9Gg}{EmKDYp9tSyaQcRhQxPwCUg}{127.0.0.1}{127.0.0.1:9300}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}
        at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$32(IndexShard.java:3878) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$10(StoreRecovery.java:626) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener$1.onFailure(ActionListener.java:90) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:347) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:123) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:2895) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:982) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
        at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
        ... 11 more
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
        at org.opensearch.index.translog.Translog.ensureOpen(Translog.java:1882) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.translog.Translog.getMaxSeqNo(Translog.java:1959) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.translog.InternalTranslogManager.getMaxSeqNo(InternalTranslogManager.java:398) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:337) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngine.<init>(InternalEngine.java:223) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:43) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:2605) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2511) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2483) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:750) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:125) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:344) ~[opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
        ... 8 more

With this change, we are ensuring that we are iterating over the vector values to find the correct live docs and then pass it to Quantization and NativeIndexWriter.

Testing

I added an IT that simulate this buggy nature, but due to some race conditions it doesn't work 100% of the time. I will follow up on fixing the IT and make it more strict.

Manual Testing

Currently I tested manually by below flow, which always create exceptions without this fix.

curl --request PUT \
  --url http://localhost:9200/my-knn-index-1 \
  --header 'Content-Type: application/json' \
  --data '{
	"settings": {
		"index": {
			"number_of_shards": 1,
			"number_of_replicas": 0,
			"refresh_interval": "1s",
			"knn": true
		}
	},
	"mappings": {
		"properties": {
			"my_vector2": {
				"type": "knn_vector",
				"dimension": 4,
				"space_type": "innerproduct"
			},
			"my_vector3": {
				"type": "knn_vector",
				"dimension": 4,
				"space_type": "innerproduct",
				"mode": "on_disk",
				"compression_level": "32x"
			}
		}
	}
}'

Index Data

curl --request POST \
  --url 'http://localhost:9200/_bulk?refresh=' \
  --header 'Content-Type: application/json' \
  --data '{ "index": { "_index": "my-knn-index-1", "_id": "111" } }
{ "my_vector2": [-1.5, -5.5, -4.5, -6.4], "my_vector3": [1.5, 5.5, 4.5, 6.4], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "9111" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "my_vector3": [-1.5, -5.5, -4.5, -6.4], "price": 8.9, "message": "OpenSearch" }
'

Index data again to simulate deletes

curl --request POST \
  --url 'http://localhost:9200/_bulk?refresh=' \
  --header 'Content-Type: application/json' \
  --data '{ "index": { "_index": "my-knn-index-1", "_id": "111" } }
{ "my_vector2": [-1.5, -5.5, -4.5, -6.4], "my_vector3": [1.5, 5.5, 4.5, 6.4], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "9111" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "my_vector3": [-1.5, -5.5, -4.5, -6.4], "price": 8.9, "message": "OpenSearch" }
'

Do force merge

curl --request POST \
  --url 'http://localhost:9200/my-knn-index-1/_forcemerge?max_num_segments=1&wait_for_completion=true' \
  --header 'Content-Type: application/json'

check logs to see the errors.

** Do Search it fails without this fix**

curl --request POST \
  --url http://localhost:9200/my-knn-index-1/_search \
  --header 'Content-Type: application/json' \
  --data '{
	"query":{
		"knn":{
			"my_vector2": {
				"vector": [1,1,1,1],
				"k": 10
			}
		}
	}
}'

Issue

#1949

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@navneet1v
Copy link
Collaborator Author

Checking why CIs are failing

@navneet1v
Copy link
Collaborator Author

Checking why CIs are failing

CIs were failing because the code was skipping some docs during live docs counting. Now the logic is fixed. Thanks @jmazanec15 for help in debugging.

…eted docs in it

Signed-off-by: Navneet Verma <navneev@amazon.com>
@Vikasht34
Copy link
Collaborator

LGTM

@navneet1v
Copy link
Collaborator Author

waiting on CIs to complete.

@navneet1v navneet1v merged commit da854c9 into opensearch-project:main Sep 5, 2024
29 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 5, 2024
…eted docs in it (#2046)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit da854c9)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 5, 2024
…eted docs in it (#2046)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit da854c9)
navneet1v added a commit that referenced this pull request Sep 5, 2024
…eted docs in it (#2046) (#2051)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit da854c9)

Co-authored-by: Navneet Verma <navneev@amazon.com>
navneet1v added a commit that referenced this pull request Sep 5, 2024
…eted docs in it (#2046) (#2052)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit da854c9)

Co-authored-by: Navneet Verma <navneev@amazon.com>
akashsha1 pushed a commit to akashsha1/k-NN that referenced this pull request Sep 16, 2024
…eted docs in it (opensearch-project#2046)

Signed-off-by: Navneet Verma <navneev@amazon.com>
Signed-off-by: Akash Shankaran <akash.shankaran@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants