Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

Open
PaulLesur opened this issue Nov 22, 2022 · 7 comments
Open

[BUG] Snapshot restoration fails due to CorruptIndexException #5337

PaulLesur opened this issue Nov 22, 2022 · 7 comments
Labels
bug Something isn't working Indexing & Search

Comments

@PaulLesur
Copy link

Describe the bug
Indices cannot be restored because of a 1 byte difference in the snapshot.

To Reproduce

  • Take a snapshot
  • Try to restore this snapshot

Expected behavior
The snapshot should be restored properly.

Host/Environment (please complete the following information):

  • Opensearch deployed on Kubernetes
    Here are the information about our cluster:
{
  "name" : "opensearch-cluster-master-0",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "MXWjw2e2RsKpx3jg27pPag",
  "version" : {
    "distribution" : "opensearch",
    "number" : "1.2.4",
    "build_type" : "tar",
    "build_hash" : "e505b10357c03ae8d26d675172402f2f2144ef0f",
    "build_date" : "2022-01-14T03:38:06.881862Z",
    "build_snapshot" : false,
    "lucene_version" : "8.10.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Additional context
We use opensearch to store our logs. Our Opensearch cluster runs on Kubernetes using the official docker image.
We have a cronjob that snapshots indices everyday to an Azure Blob Storage and deletes the indices older than 5 days. When we want to access old logs data we restore the snapshots, but some of the indices in these snapshots fail to be restored.
This issue is random and can affect one snapshot over 20...

For example, an indice named backend-api-20220608:

The indice health is red :

health status index                           uuid                   pri rep docs.count docs.deleted store.size pri.store.size
red    open   backend-api-20220608            ZBZeXE7ZSEOq1qW3ev3k3Q   2   1                                                  

The logs in Opensearch :

[2022-11-22T13:50:25,996][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-cluster-master-0] [backend-api-20220608][1] marking and sending shard failed due to [failed recovery]
org.opensearch.indices.recovery.RecoveryFailedException: [backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}
	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3071) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:421) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:554) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$67(BlobStoreRepository.java:2591) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:96) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$69(BlobStoreRepository.java:2718) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:780) [opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
	... 17 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	... 15 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]
	... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.verify(Store.java:1260) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1278) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1305) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.restoreFile(BlobStoreRepository.java:2677) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.lambda$executeOneFileRestore$1(BlobStoreRepository.java:2641) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	... 3 more
[2022-11-22T13:50:26,005][WARN ][o.o.c.r.a.AllocationService] [opensearch-cluster-master-0] failing shard [failed shard, shard [backend-api-20220608][1], node[KvqTHUvyQxiZAoGKhN5GlA], [P], recovery_source[snapshot recovery [KV3Ie4LZRjeMzoDYpUXIMg] from prod-snapshots-os:prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow], s[INITIALIZING], a[id=kXZSRB_aQMmD9MSUJ3oNLQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2022-11-22T13:46:22.134Z], delayed=false, details[restore_source[prod-snapshots-os/prod-curator-20220609]], allocation_status[fetching_shard_data]], expected_shard_size[6915572784], message [failed recovery], failure [RecoveryFailedException[[backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: IndexShardRestoreFailedException[failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]]; nested: CorruptIndexException[verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [backend-api-20220608][1]: Recovery failed on {opensearch-cluster-master-0}{KvqTHUvyQxiZAoGKhN5GlA}{S5g3fl2XTv-cMF2COLwmnw}{10.250.3.226}{10.250.3.226:9300}{dimr}{shard_indexing_pressure_enabled=true}
	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$21(IndexShard.java:3071) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$6(StoreRecovery.java:421) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.shard.StoreRecovery.lambda$restore$8(StoreRecovery.java:554) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$67(BlobStoreRepository.java:2591) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:188) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:96) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$fileQueueListener$69(BlobStoreRepository.java:2718) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionListener$2.onFailure(ActionListener.java:109) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:101) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:780) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-1.2.4.jar:1.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
	... 17 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: restore failed
	... 15 more
Caused by: org.opensearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [prod-curator-20220609/trKkuPZqRUaRy9q5va08Ow]
	... 13 more
Caused by: org.apache.lucene.index.CorruptIndexException: verification failed (hardware problem?) : expected=giidc5 actual=giidc5 footer=null writtenLength=3300529277 expectedLength=3300529276 (resource=name [_16gz.fdt], length [3300529276], checksum [giidc5], writtenBy [8.10.1]) (resource=VerifyingIndexOutput(_16gz.fdt))
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.verify(Store.java:1260) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeByte(Store.java:1278) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.index.store.Store$LuceneVerifyingIndexOutput.writeBytes(Store.java:1305) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.restoreFile(BlobStoreRepository.java:2677) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.repositories.blobstore.BlobStoreRepository$9.lambda$executeOneFileRestore$1(BlobStoreRepository.java:2641) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:86) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792) ~[opensearch-1.2.4.jar:1.2.4]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50) ~[opensearch-1.2.4.jar:1.2.4]
	... 3 more
[2022-11-22T13:50:26,008][INFO ][o.o.c.r.a.AllocationService] [opensearch-cluster-master-0] Cluster health status changed from [YELLOW] to [RED] (reason: [shards failed [[backend-api-20220608][1]]]).
[2022-11-22T13:53:16,032][INFO ][o.o.j.s.JobSweeper       ] [opensearch-cluster-master-0] Running full sweep

The error seems to be here : writtenLength=3300529277 expectedLength=3300529276

What could cause this difference ? Is it possible to restore the indice without the corrupted data ? Is our data lost or is there a solution to recover it ?

@PaulLesur PaulLesur added bug Something isn't working untriaged labels Nov 22, 2022
@andrross andrross changed the title [BUG] Snapshot restauration fails due to CorruptIndexException [BUG] Snapshot restoration fails due to CorruptIndexException Nov 22, 2022
@PaulLesur
Copy link
Author

Hello ! Do you think that the data in the snapshots is recoverable or did I lost the data ?

@PaulLesur
Copy link
Author

Is there a way to recover the data stored in the snapshot ?

@PaulLesur
Copy link
Author

🆙

@dblock
Copy link
Member

dblock commented Mar 3, 2023

Sorry it took so long to reply here. It does look like a data corruption problem, but let's not exclude bugs. I think it needs narrowing down. I would try a few things:

  1. Does a retry fix the restore? (networking issue)
  2. Attempt (twice) to restore immediately after taking a snapshot, to see if this is a corruption at snapshot or during restore?

@PaulLesur
Copy link
Author

Hello @dblock ! Sorry for the delay.

  1. No, we retried several times but the same issue occurred again and again.
  2. We have multiple snapshots with the same indices in them, when we try to restore a corrupted index, it is corrupted in all the snapshots.

@dblock
Copy link
Member

dblock commented Apr 17, 2023

So the issue is that these snapshots are created corrupted. I would keep narrowing down to what causes this, but not sure what to suggest next, maybe @andrross will have an idea? Is upgrading to the latest 1.x or even 2.x and seeing if this is still a problem an option?

@AmiStrn
Copy link
Contributor

AmiStrn commented May 10, 2023

checksums are performed during allocation and snapshot restore. so the corruption could be on disk, then after the next snapshot is successful the snapshot is corrupted as well. we had an issue where we had an old snapshot that was recoverable but the new ones were all corrupted. could not replicate the issue either.
Additionally, this only happened on our Azure deployments (using azure blob as well)

We have yet to figure this one out as it is rare and hard to replicate. we tried this doc but it is not the same issue, maybe it can help your issue though @PaulLesur:
https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Indexing & Search
Projects
None yet
Development

No branches or pull requests

5 participants