[BUG] Snapshot restoration fails due to CorruptIndexException #5337

PaulLesur · 2022-11-22T14:08:30Z

Describe the bug
Indices cannot be restored because of a 1 byte difference in the snapshot.

To Reproduce

Take a snapshot
Try to restore this snapshot

Expected behavior
The snapshot should be restored properly.

Host/Environment (please complete the following information):

Opensearch deployed on Kubernetes
Here are the information about our cluster:

{
  "name" : "opensearch-cluster-master-0",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "MXWjw2e2RsKpx3jg27pPag",
  "version" : {
    "distribution" : "opensearch",
    "number" : "1.2.4",
    "build_type" : "tar",
    "build_hash" : "e505b10357c03ae8d26d675172402f2f2144ef0f",
    "build_date" : "2022-01-14T03:38:06.881862Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Additional context
We use opensearch to store our logs. Our Opensearch cluster runs on Kubernetes using the official docker image.
We have a cronjob that snapshots indices everyday to an Azure Blob Storage and deletes the indices older than 5 days. When we want to access old logs data we restore the snapshots, but some of the indices in these snapshots fail to be restored.
This issue is random and can affect one snapshot over 20...

For example, an indice named backend-api-20220608:

The indice health is red :

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   backend-api-20220608            ZBZeXE7ZSEOq1qW3ev3k3Q   2   1

The logs in Opensearch :

[2022-11-22T13:50:25,996][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-cluster-master-0] [backend-api-20220608][1] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}
	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3071) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:421) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:554) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$67(BlobStoreRepository.java:2591) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:96) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$69(BlobStoreRepository.java:2718) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:780) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
	... 17 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	... 15 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]
	... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.verify(Store.java:1260) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1278) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1305) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.restoreFile(BlobStoreRepository.java:2677) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.lambda$executeOneFileRestore$1(BlobStoreRepository.java:2641) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	... 3 more
[2022-11-22T13:50:26,005][WARN ][o.o.c.r.a.AllocationService] [opensearch-cluster-master-0] failing shard [failed shard, shard [backend-api-20220608][1], node[KvqTHUvyQxiZAoGKhN5GlA], [P], recovery_source[snapshot recovery [KV3Ie4LZRjeMzoDYpUXIMg] from prod-snapshots-os:prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow], s[INITIALIZING], a[id=kXZSRB_aQMmD9MSUJ3oNLQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2022-11-22T13:46:22.134Z], delayed=false, details[restore_source[prod-snapshots-os/prod-curator-20220609]], allocation_status[fetching_shard_data]], expected_shard_size[6915572784], message [failed recovery], failure [RecoveryFailedException[[backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]]; nested: CorruptIndexException[verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}
	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3071) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:421) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:554) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$67(BlobStoreRepository.java:2591) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:96) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$69(BlobStoreRepository.java:2718) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:780) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
	... 17 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	... 15 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]
	... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.verify(Store.java:1260) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1278) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1305) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.restoreFile(BlobStoreRepository.java:2677) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.lambda$executeOneFileRestore$1(BlobStoreRepository.java:2641) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	... 3 more
[2022-11-22T13:50:26,008][INFO ][o.o.c.r.a.AllocationService] [opensearch-cluster-master-0] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[backend-api-20220608][1]]]).
[2022-11-22T13:53:16,032][INFO ][o.o.j.s.JobSweeper       ] [opensearch-cluster-master-0] Running full sweep

The error seems to be here : writtenLength=3300529277 expectedLength=3300529276

What could cause this difference ? Is it possible to restore the indice without the corrupted data ? Is our data lost or is there a solution to recover it ?

The text was updated successfully, but these errors were encountered:

PaulLesur · 2022-11-30T15:30:45Z

Hello ! Do you think that the data in the snapshots is recoverable or did I lost the data ?

PaulLesur · 2023-01-04T14:07:13Z

Is there a way to recover the data stored in the snapshot ?

PaulLesur · 2023-02-23T10:58:45Z

🆙

dblock · 2023-03-03T20:06:48Z

Sorry it took so long to reply here. It does look like a data corruption problem, but let's not exclude bugs. I think it needs narrowing down. I would try a few things:

Does a retry fix the restore? (networking issue)
Attempt (twice) to restore immediately after taking a snapshot, to see if this is a corruption at snapshot or during restore?

PaulLesur · 2023-04-17T13:58:24Z

Hello @dblock ! Sorry for the delay.

No, we retried several times but the same issue occurred again and again.
We have multiple snapshots with the same indices in them, when we try to restore a corrupted index, it is corrupted in all the snapshots.

dblock · 2023-04-17T18:40:58Z

So the issue is that these snapshots are created corrupted. I would keep narrowing down to what causes this, but not sure what to suggest next, maybe @andrross will have an idea? Is upgrading to the latest 1.x or even 2.x and seeing if this is still a problem an option?

AmiStrn · 2023-05-10T20:01:44Z

checksums are performed during allocation and snapshot restore. so the corruption could be on disk, then after the next snapshot is successful the snapshot is corrupted as well. we had an issue where we had an old snapshot that was recoverable but the new ones were all corrupted. could not replicate the issue either.
Additionally, this only happened on our Azure deployments (using azure blob as well)

We have yet to figure this one out as it is rare and hard to replicate. we tried this doc but it is not the same issue, maybe it can help your issue though @PaulLesur:
https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux

PaulLesur added bug Something isn't working untriaged labels Nov 22, 2022

minalsha added the Indexing & Search label Nov 22, 2022

andrross changed the title ~~[BUG] Snapshot restauration fails due to CorruptIndexException~~ [BUG] Snapshot restoration fails due to CorruptIndexException Nov 22, 2022

anasalkouz removed the untriaged label Nov 22, 2022

anasalkouz added Migration:Backlog and removed Migration:Backlog labels Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

PaulLesur commented Nov 22, 2022

PaulLesur commented Nov 30, 2022

PaulLesur commented Jan 4, 2023

PaulLesur commented Feb 23, 2023

dblock commented Mar 3, 2023 •

edited

Loading

PaulLesur commented Apr 17, 2023

dblock commented Apr 17, 2023 •

edited

Loading

AmiStrn commented May 10, 2023 •

edited

Loading

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

Comments

PaulLesur commented Nov 22, 2022

PaulLesur commented Nov 30, 2022

PaulLesur commented Jan 4, 2023

PaulLesur commented Feb 23, 2023

dblock commented Mar 3, 2023 • edited Loading

PaulLesur commented Apr 17, 2023

dblock commented Apr 17, 2023 • edited Loading

AmiStrn commented May 10, 2023 • edited Loading

dblock commented Mar 3, 2023 •

edited

Loading

dblock commented Apr 17, 2023 •

edited

Loading

AmiStrn commented May 10, 2023 •

edited

Loading