Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding search node to the cluster breaks snapshot repository #13024

Open
rlevytskyi opened this issue Apr 2, 2024 · 8 comments
Open

Adding search node to the cluster breaks snapshot repository #13024

rlevytskyi opened this issue Apr 2, 2024 · 8 comments
Labels
bug Something isn't working Other Storage:Snapshots

Comments

@rlevytskyi
Copy link

rlevytskyi commented Apr 2, 2024

Describe the bug

We have a cluster of ten nodes running for a long time.
We use SMB share to store our snapshot repository. Many years it receives daily snapshots and sometimes some of them got restored successfully.
After testing the searchable snapshot feature at the test installation, we decided to implement it to our production system.
However, it turned out that adding the node.roles: [search] node makes storage snapshot inoperable. Here is the output:
% curl logs:9200/_snapshot/searchable/_verify -XPOST | jq . { "error": { "root_cause": [ { "type": "repository_verification_exception", "reason": "[searchable] cannot delete test data at " } ], "type": "repository_verification_exception", "reason": "[searchable] cannot delete test data at ", "caused_by": { "type": "directory_not_empty_exception", "reason": "/usr/share/opensearch/searchable/tests-WuWXrwFrTd-BiVX9VfEkTw" } }, "status": 500 }
At the same time, using any node, including new search node:
% ssh <any_node> sudo docker exec <container_name> 'ls -l /usr/share/opensearch/searchable/tests-WuWXrwFrTd-BiVX9VfEkTw' total 0
Just by switching new search node off:
% curl logs:9200/_snapshot/searchable/_verify -XPOST | jq . { "nodes": { <all 8 data/master nodes here> }}

I.e. it is definitely an empty directory.

Related component

Other

To Reproduce

Just add node.roles: [ search ] node to existing '[data]', '[master]', and '[ ]' nodes and see weird verification error.

Expected behavior

Adding search node should not affect existing cluster.

Additional Details

Plugins
Security with Keycloak SAML

Host/Environment (please complete the following information):

  • OS: Oracle Enteriprise Linux 8.9

Additional context
Master node log at the next message.

@rlevytskyi rlevytskyi added bug Something isn't working untriaged labels Apr 2, 2024
@github-actions github-actions bot added the Other label Apr 2, 2024
@rlevytskyi
Copy link
Author

Here is the related master node log excerpt:
[2024-04-02T17:12:40,341][WARN ][o.o.r.RepositoriesService] [v4-master.company.com] [searchable] failed to finish repository verification org.opensearch.repositories.RepositoryVerificationException: [searchable] cannot delete test data at at org.opensearch.repositories.blobstore.BlobStoreRepository.endVerification(BlobStoreRepository.java:1986) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.repositories.RepositoriesService$3.lambda$doRun$1(RepositoriesService.java:383) [opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:854) [opensearch-2.12.0.jar:2.12.0] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] at java.base/java.lang.Thread.run(Thread.java:840) [?:?] Caused by: java.nio.file.DirectoryNotEmptyException: /usr/share/opensearch/searchable/tests-evLsfSTJQAOl5JGzamyFwQ at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:246) ~[?:?] at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:105) ~[?:?] at java.base/java.nio.file.Files.delete(Files.java:1152) ~[?:?] at org.opensearch.common.blobstore.fs.FsBlobContainer$1.postVisitDirectory(FsBlobContainer.java:141) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.common.blobstore.fs.FsBlobContainer$1.postVisitDirectory(FsBlobContainer.java:137) ~[opensearch-2.12.0.jar:2.12.0] at java.base/java.nio.file.Files.walkFileTree(Files.java:2828) ~[?:?] at java.base/java.nio.file.Files.walkFileTree(Files.java:2882) ~[?:?] at org.opensearch.common.blobstore.fs.FsBlobContainer.delete(FsBlobContainer.java:137) ~[opensearch-2.12.0.jar:2.12.0] at org.opensearch.repositories.blobstore.BlobStoreRepository.endVerification(BlobStoreRepository.java:1984) ~[opensearch-2.12.0.jar:2.12.0] ... 5 more

@rlevytskyi
Copy link
Author

rlevytskyi commented Apr 3, 2024

To collect more information, I've set logger.level: DEBUG at all four master-eligible nodes and restated them.
And now my repository got successfully verified...

UPD: few minutes later, it failed to verify again

UPD2: it seems that it verified successfully sometimes, 5 to 10 per cent

@peternied
Copy link
Member

[Triage - attendees 1 2 3 4 5 6 7 8]
@rlevytskyi Thanks for filing this issue

@rlevytskyi
Copy link
Author

I wrote a simple Python program to check the contents of the snapshot directory every 0.01 seconds.

Here is it's output when search node stopped and repo verified successfully:
dir-ok.txt

Here is it's output when search node connected and this error returned

"type": "directory_not_empty_exception",
"reason": "/usr/share/opensearch/searchable/tests-s4d0kZgWQL6dXHjTnIcMiw"
dir-fail.txt

I.e. two file are sitting here extra couple of seconds.

@rlevytskyi
Copy link
Author

In case somebody interested, I managed to conquer this issue.

Our cluster is running Oracle Linux Server 8.9 with ol8_UEKR6 kernel (5.4.17).
New node is running Oracle Linux Server 8.9 with ol8_UEKR7 kernel (5.15.0).

We've upgraded all our nodes to 5.15 and got backup broken (repository verify failed).

Then I've restarted all nodes at 5.4 kernel and found repo verification is working properly again.

Then downgraded new node kernel to 5.4, add it to cluster and it works properly now.

@dblock
Copy link
Member

dblock commented Apr 19, 2024

Thanks for narrowing it down! Still, what is the root cause of the verification failure?

@rlevytskyi
Copy link
Author

I was unable to identify the exact cause.
What I've noticed that with 5.4 kernel, we see some time discrepancy (sometimes up to 2 seconds) between system clock and hardware clock of VMs (as per 'timedatectl' ouptput).
With 5.15 kernel, time is the same at system clock and hardware clock.
However, with 5.4 kernel snapshot just works as it supposed to works and we are now successfully implementing the searchable snapshot feature.

@ksyblast
Copy link

ksyblast commented Oct 30, 2024

Hello team,

We are working together with @rlevytskyi and we have tried to identify the root cause of the failure, unfortunately, without any meaningful results.

Our lab:

  • snapshot repository backed with SMB storage at TrueNAS
  • searchable snapshot node working with kernel 5.15
  • other cluster nodes working with kernel 5.4

We have identified that the search node can successfully create the test files in the repository. Afterwards it seems that the files should be deleted by the master node. Then it's verified that the files could be deleted by the master. However with kernel 5.15 it looks like the master node cannot delete the files created by the search node and verification fails. Rollback to 5.4 fixes the issue.

Probably it's some synchronization problem between the two nodes when the master node tries to delete the files too early or the file is still "locked" by the search node.

Yet it is still unclear how it can be related to the kernel version. We tried to play with SMB options - disabled cache and played around with versions - same issue. In addition, the files can be successfully created and deleted on the share by any node. The problem seems to be only about the repository verification procedure.

We have also tried to boot all the nodes to kernel 5.15 but it has also resulted with the error.

Since we are not Java developers, it is not so easy for us to analyze the code and completely understand the algorithm of the repository verification to find out the root cause of the issue. So we would be grateful for any tips about where we can dig deeper.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Other Storage:Snapshots
Projects
Status: 🆕 New
Development

No branches or pull requests

4 participants