Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segrep replication gets stuck in MemorySegmentIndexInput close #16180

Open
Bukhtawar opened this issue Oct 3, 2024 · 4 comments
Open
Labels
bug Something isn't working Storage:Performance

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Oct 3, 2024

Describe the bug

Although a preview feature in JDK 21, foreign memory access was enabled in Lucene as a part of apache/lucene#12294. This causes MemorySegmentIndexInput#close to get stuck espl when multiple threads are attempting to perform a close.
This causes all generic threads to get stuck in a similar stack.

"opensearch[64cbf99ff6d808441bd31748af4095c9][generic][T#114]" #421 [26493] daemon prio=5 os_prio=0 cpu=5226080.42ms elapsed=1125628.72s tid=0x0000ffdca8095c60 nid=26493 waiting on condition  [0x0000ffdb5aefd000]
   java.lang.Thread.State: RUNNABLE
        at jdk.internal.misc.ScopedMemoryAccess.closeScope0(java.base@21.0.4/Native Method)
        at jdk.internal.misc.ScopedMemoryAccess.closeScope(java.base@21.0.4/ScopedMemoryAccess.java:87)
        at jdk.internal.foreign.SharedSession.justClose(java.base@21.0.4/SharedSession.java:87)
        at jdk.internal.foreign.MemorySessionImpl.close(java.base@21.0.4/MemorySessionImpl.java:242)
        at jdk.internal.foreign.MemorySessionImpl$1.close(java.base@21.0.4/MemorySessionImpl.java:88)
        at org.apache.lucene.store.MemorySegmentIndexInput.close(MemorySegmentIndexInput.java:494)
        at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1242)
        at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1182)
        at org.opensearch.index.store.Store.getSegmentMetadataMap(Store.java:391)
        at org.opensearch.index.shard.IndexShard.getSegmentMetadataMap(IndexShard.java:1980)
        at org.opensearch.indices.replication.SegmentReplicationTarget.getFiles(SegmentReplicationTarget.java:204)
        at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:182)
        at org.opensearch.indices.replication.SegmentReplicationTarget$$Lambda/0x00000080037bc248.accept(Unknown Source)
        at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:412)
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120)
        at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82)
        at org.opensearch.action.StepListener.whenComplete(StepListener.java:95)
        at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:181)
        at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:585)
        at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.doRun(SegmentReplicationTargetService.java:571)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:950)
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21.0.4/ThreadPoolExecutor.java:1144)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21.0.4/ThreadPoolExecutor.java:642)
        at java.lang.Thread.runWith(java.base@21.0.4/Thread.java:1596)
        at java.lang.Thread.run(java.base@21.0.4/Thread.java:1583)

Related component

Storage:Performance

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

The close shouldn't get impacted and threads should immediately be released

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS : 2.15
  • JDK: 21

Additional context
Add any other context about the problem here.

@Bukhtawar
Copy link
Collaborator Author

Related issue #15902

@Bukhtawar
Copy link
Collaborator Author

Looks like a work-around here is to disable memory segments using

-Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false

@reta
Copy link
Collaborator

reta commented Oct 3, 2024

@Bukhtawar could you please check with #15333 ? The impl has been changed in Apache Lucene regarding the use of FFI&M APIs

@Bukhtawar
Copy link
Collaborator Author

Thanks @reta I am aware of the work. This is for versions 2.17 and older enabled on Lucene 9.7+ with JDK 21 that will see an issue and the mitigation there is to disable memory segments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Performance
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants