[Java][Fix] Multithreaded querying fails without synchronization #1082

chatman · 2025-07-03T20:55:02Z

Fix CuVSResources usage for multithreaded queries

* Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex
* Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources.

…ory alloc errors

copy-pr-bot · 2025-07-03T20:55:05Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

chatman · 2025-07-04T08:11:46Z

Edit: deleted the comment, since I was earlier assuming this was a benign issue with just error reporting, but turns out to be a real issue.

…ng method call

chatman · 2025-07-07T12:44:41Z

While adding the synchronized here makes it work, it just proves that there are some underlying concurrency issues that need solving. I don't believe this is the final fix.

ldematte · 2025-07-07T12:49:19Z

Hey @chatman, I observed a similar issue in benchmarks; my gut feeling is this is not a synchronization issue, but rather a memory exhaustion problem that becomes apparent when you are "hammering" the API with enough calls. It can be that the synchronization is simply slowing things down enough to let the resources free and make the test pass.

In my case, this was due to the fact that datasets were freed "too late"; now the arena on which they are bound is freed much earlier, and I don't have the same problem anymore.
I was benchmarking the index side, but I think the search side is suffering from the same issues.
Can you try #1069 (without synchronized) and see if you still have the same issue?

ldematte · 2025-07-07T12:57:08Z

To give you some context: #1069 gets rid of the "global allocations" on the CuVSResources arena.
It may still not be enough; if I read the message correctly, it is cuvsRMMAlloc that is failing, right?
We pair it correctly with cuvsRMMFree already, but I guess if you run enough searches with enough concurrently, you can exhaust device memory.
To see if the issue is memory exhaustion vs synchronization, you can wrap cuvsRMMAlloc/cuvsRMMFree in some function that saves (in a static) the number of bytes currently allocated, and see what this number is when your tests blow up. If the number is far from your device free memory, then this is really a synchronization issue. If it's near your device capacity, it is likely you "just" using too much memory and you need to throttle your searches.

Ummm actually that might not be enough, as it will not account for memory allocated internally by cuvsCagraSearch :/

chatman · 2025-07-07T13:00:19Z

Can you try #1069 (without synchronized) and see if you still have the same issue?

Sure, trying!

ldematte · 2025-07-07T13:01:38Z

You can try and keep an eye via nvidia-smi as your tests run; not ideal, but should give you a feel if you are reaching your GPU memory limits

chatman · 2025-07-07T13:03:19Z

my gut feeling is this is not a synchronization issue, but rather a memory exhaustion problem that becomes apparent when you are "hammering" the API with enough calls

FYI, in my testing (test added in this PR), even with numThreads=2, the search() fails without the synchronized.

chatman · 2025-07-07T13:04:36Z

You can try and keep an eye via nvidia-smi as your tests run; not ideal, but should give you a feel if you are reaching your GPU memory limits

I have a 48GB GPU, and the index I was trying has barely 25k vectors, and just a few search queries. I'm pretty sure it is a misreporting regarding RMM out_of_memory.

ldematte · 2025-07-07T13:06:25Z

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CagraIndexImpl.java

-      int vectorDimension = numQueries > 0 ? query.getQueryVectors()[0].length : 0;
-      Arena arena = resources.getArena();
+    // Use the shared arena from resources instead of creating a confined arena
+    // to avoid thread safety issues when multiple threads perform searches


I don't think this is true; if this is a memory issue, using the shared arena will make things worse as it will bound the allocated memory to the lifetime of the resource (they will be freed "too late").

Also, a local confined arena is inherently thread safe, as it is "private" to the thread stack calling this function.

(And also, the localArena is allocated here, but never used -- we use the Arena from Arena arena = resources.getArena(); a few lines below, so effectively the code is the same -- BTW I think this is a mistake, and I'm fixing it in #1069)

To be clear, I'm not saying there aren't threading issues here, just saying that the localArena is not the culprit.

ldematte · 2025-07-07T13:17:13Z

I have a 48GB GPU, and the index I was trying has barely 25k vectors, and just a few search queries. I'm pretty sure it is a misreporting regarding RMM out_of_memory.

Yes, with this numbers if very likely not a memory exhaustion issue. It can be a memory corruption issue, that might be caused by synchronization issues.

While adding the synchronized here makes it work, it just proves that there are some underlying concurrency issues that need solving. I don't believe this is the final fix.

I agree, we need to understand what is breaking and fix it

chatman · 2025-07-07T13:17:59Z

Can you try #1069 (without synchronized) and see if you still have the same issue?

Sure, trying!

~~The test I added here passes in your #1069 branch without any changes. That's a good sign :-) Looking deeper.~~

Oh no, I was mistaken:

[INFO] Running com.nvidia.cuvs.CagraMultiThreadStabilityIT
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CuVSTestCase - Test seed: 9A6FE6A6989BED73
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Multi-threaded stability test initialized
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Starting synchronized search requirement test:
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   Dataset: 10000x256
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   Threads: 16, Queries per thread: 50
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   This test demonstrates why synchronized keyword is required
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Creating CAGRA index for synchronization test...
[695196][18:48:35:165747][info  ] optimizing graph
[695196][18:48:35:207521][info  ] Graph optimized, creating index
Aborted (core dumped)
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[WARNING] Corrupted channel by directly writing to native stream in forked JVM 1. See FAQ web page and the dump file /home/ishan/code/cuvs/java/cuvs-java/target/failsafe-reports/2025-07-07T18-48-33_996-jvmRun1.dumpstream
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  12.621 s
[INFO] Finished at: 2025-07-07T18:48:36+05:30
[INFO] ------------------------------------------------------------------------

There's a JVM crash, but still there's BUILD SUCCESS (which fooled me).

ldematte · 2025-07-07T13:23:24Z

java/cuvs-java/src/test/java/com/nvidia/cuvs/CagraMultiThreadStabilityIT.java

+
+    float[][] dataset = generateRandomDataset(dataSize, dimensions);
+
+    try (CuVSResources resources = CuVSResources.create()) {


Ah yes, now I can see it.
I don't think this will work, not with the current cuvs implementation. We were discussing this internally and I think we also got some (preliminary) confirmation from cuvs developers: you will need one resource per-thread. Or, in other words, you can access a resource via different threads, but you cannot access them concurrently.

ldematte · 2025-07-07T13:25:30Z

There's a JVM crash, but still there's BUILD SUCCESS (which fooled me).

Yes, the JUnit test runner will do that, you need to explicitly handle failures (and sometimes you cannot/it's hard to do)

chatman · 2025-07-07T13:28:56Z

We were discussing this internally and I think we also got some (preliminary) confirmation from cuvs developers:
you will need one resource per-thread. Or, in other words, you can access a resource via different threads, but you cannot access them concurrently.

FYI I'm trying to reason this out, and I guess that the search() calls need to be executed in a separate cuda streams in order to succeed (one stream per call). There's cudaStreamCreate or something like that, and I'm wondering if that is needed / that will help. Currently, the same cuda stream is being shared across multiple threads, and hence failing. Does that make sense @ldematte @benfred @cjnolet ?

cjnolet · 2025-07-07T19:23:20Z

Currently, the same cuda stream is being shared across multiple threads, and hence failing.

Are you explicitly creating a stream or are you using the stream that gets created in the underlying raft::resources by default? If the latter, you are not actually using the same stream across raft::resources. We are very careful to note that raft::resources is not claimed to be thread safe. That being said, we often provide a default option of adding a synchronization after every cuVS function when an explicit resources object has not been provided. It wouldn't help this case, but I'd go as far as to argue that this is not a valid usage of raft::resources, given the documentation. I'm not sure it's up to cuVS to protect to against this when it's already being documented. We can certainly highlight the documentation better within cuVS itself.

ldematte · 2025-07-08T07:46:54Z

I'd go as far as to argue that this is not a valid usage of raft::resources, given the documentation

I think you are right, that is my impression too.

We can certainly highlight the documentation better within cuVS itself.

That would be very helpful, especially for C/Java users where raft::resources is not explicitly exposed (it is hidden in the API implementation).

ldematte · 2025-07-08T13:52:31Z

@cjnolet @chatman @ChrisHegarty I drafted a PR that might help: #1089. It introduces the possibility to check that the threading constraints for raft::resources are respected.

It explicitly model the concept of access scope for a resource; the Javadoc there explains the threading constraints. It also adds a (optional) decorator to CuVSResources that does enforce the one-thread-at-time behaviour (and I'm using that in tests -- actually at the moment it is test only, but it could be a utility in the main library if we want, or can be just example code that cuvs-java users can borrow).

chatman · 2025-07-30T19:14:27Z

@mythrocks This is now ready for your review. @narangvivek10 and I have tested this extensively using the provided CagraMultithreadedStabilityIT, as well as tested using Lucene.

While earlier (without separate resources per thread) we were encountering lots of "out_of_memory" errors or crashes, now it is no longer the case as per our testing.

mythrocks · 2025-07-30T22:26:58Z

java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswQuery.java

+     *
+     * @param resources the CuVSResources instance to use for this query (must not be shared between threads)
+     */
+    public Builder(CuVSResources resources) {


I was about to ask why this seems to be breaking the fluent-interface semantics of the Builder.

It looks like this is to necessitate specifying the CuVSResources when building a query. Yep, that makes sense to do. 👍

mythrocks · 2025-07-30T22:28:08Z

/ok to test 6bd62ec

mythrocks · 2025-07-30T22:31:13Z

java/cuvs-java/src/main/java/com/nvidia/cuvs/HnswQuery.java


  /**
-   * Builder helps configure and create an instance of BruteForceQuery.
+   * Builder helps configure and create an instance of HnswQuery.


mythrocks

This LGTM. Pending tests.

mythrocks · 2025-07-31T03:51:37Z

Hmm. Looks like there are Java test failures. :/

We could consider pushing this to an early merge to 25.10.

chatman · 2025-07-31T11:43:51Z

Error: Errors:
Error: CagraBuildAndSearchIT.testIndexingAndSearchingFlowInDifferentThreads ? ThreadLeak 1 thread leaked from TEST scope at testIndexingAndSearchingFlowInDifferentThreads(com.nvidia.cuvs.CagraBuildAndSearchIT):

Thread[id=2057, name=pool-207-thread-1, state=WAITING, group=TGRP-CagraBuildAndSearchIT]
at java.base/jdk.internal.misc.Unsafe.park(Native Method)
at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
at java.base/java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1002)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1158)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1570)

This is a very strange failure. That test passed but there's a thread leak. I'm not sure that this is related to the change here. I tried running this test separately (with the seed as well as without it) as well as the full tests suite 25 times each, and I'm unable to reproduce this failure.

@mythrocks Can you please trigger the test again to see if it reproduces or just an intermittent failure?

mythrocks · 2025-07-31T15:52:07Z

/ok to test 6bd62ec

mythrocks · 2025-07-31T15:57:06Z

I'm re-running it now. I'll keep an eye out for the result, and see about taking a closer look today.

mythrocks · 2025-07-31T17:02:09Z

The re-run did pass. Looks like a stray failure that, at worst, affects the test, not the functionality.

I'll check for the viability of this being merged to branch-25.08.

mythrocks · 2025-07-31T17:07:38Z

I just checked. We are properly in code-freeze now. Let's retarget to 25.10.

We can check if the test failures happen to recur in the new dev branch.

chatman · 2025-07-31T17:25:10Z

I just checked. We are properly in code-freeze now. Let's retarget 25.10.

Ouch, we got lucked out, I guess. Thanks for looking into this. I'll re-target this PR for 25.10 now.

mythrocks · 2025-07-31T17:55:04Z

Thanks for retargeting, @chatman. Would you mind please merging the latest commits from branch-25.10? I don't have write access to the SearchScale branches. :/

cjnolet · 2025-07-31T21:32:49Z

@chatman can you please merge the current branch-25.10 into your branch here?

chatman · 2025-07-31T21:35:11Z

@mythrocks @cjnolet I just hit the update button. Haven't verified that it compiles or passes the tests on this branch.

cjnolet · 2025-07-31T22:05:44Z

/ok to test b0bf5bd

mythrocks · 2025-08-01T04:00:45Z

/merge

As a follow up to #1089 and #1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1209

…idsai#1082) Fix CuVSResources usage for multithreaded queries * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources. Authors: - Ishan Chattopadhyaya (https://github.com/chatman) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1082

) As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1209

…idsai#1082) Fix CuVSResources usage for multithreaded queries * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources. Authors: - Ishan Chattopadhyaya (https://github.com/chatman) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1082

) As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1209

Adding a test where running queries from different threads causes mem…

53fa74a

…ory alloc errors

chatman added 3 commits July 5, 2025 20:59

Disabling the test with closed resources

2f9dd8f

Removing unnecessary resources from CagraSearchParams

6c18331

Fix CAGRA search() by adding a synchronized block around the underlyi…

e80bf1d

…ng method call

chatman changed the title ~~[Java] Multithreaded querying via different resources fails with ugly message~~ [Java][Fix] Multithreaded querying fails without synchronization Jul 7, 2025

ldematte reviewed Jul 7, 2025

View reviewed changes

chatman mentioned this pull request Jul 7, 2025

[Java] Tidy up MemorySegments lifecycle #1069

Merged

cjnolet assigned chatman Jul 11, 2025

cjnolet added bug Something isn't working non-breaking Introduces a non-breaking change labels Jul 11, 2025

cjnolet added this to Vector Search, ML, & Data Mining Release Board Jul 11, 2025

cjnolet moved this to In Progress in Vector Search, ML, & Data Mining Release Board Jul 11, 2025

cjnolet added this to Elasticsearch + cuVS Team Jul 15, 2025

Fix log messages in the test

6bd62ec

mythrocks reviewed Jul 30, 2025

View reviewed changes

mythrocks approved these changes Jul 30, 2025

View reviewed changes

Merge branch 'branch-25.08' into investigate-multithreaded-failures

6e14a48

chatman changed the base branch from branch-25.08 to branch-25.10 July 31, 2025 17:25

Merge branch 'branch-25.10' into investigate-multithreaded-failures

b0bf5bd

rapids-bot bot merged commit 81e22d3 into rapidsai:branch-25.10 Aug 1, 2025
53 checks passed

github-project-automation bot moved this from In Progress to Done in Vector Search, ML, & Data Mining Release Board Aug 1, 2025

github-project-automation bot moved this from In Progress to Done in Elasticsearch + cuVS Team Aug 1, 2025

ldematte mentioned this pull request Aug 4, 2025

[Java] Add decorator for synchronized CuVSResource access #1209

Merged


		float[][] dataset = generateRandomDataset(dataSize, dimensions);

		try (CuVSResources resources = CuVSResources.create()) {

[Java][Fix] Multithreaded querying fails without synchronization #1082

[Java][Fix] Multithreaded querying fails without synchronization #1082

Uh oh!

Conversation

chatman commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jul 3, 2025

Uh oh!

chatman commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatman commented Jul 7, 2025

Uh oh!

ldematte commented Jul 7, 2025

Uh oh!

ldematte commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatman commented Jul 7, 2025

Uh oh!

ldematte commented Jul 7, 2025

Uh oh!

chatman commented Jul 7, 2025

Uh oh!

chatman commented Jul 7, 2025

Uh oh!

ldematte Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte commented Jul 7, 2025

Uh oh!

chatman commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte commented Jul 7, 2025

Uh oh!

chatman commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjnolet commented Jul 7, 2025

Uh oh!

ldematte commented Jul 8, 2025

Uh oh!

ldematte commented Jul 8, 2025

Uh oh!

chatman commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mythrocks Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

mythrocks commented Jul 30, 2025

Uh oh!

mythrocks Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

mythrocks left a comment

Choose a reason for hiding this comment

Uh oh!

mythrocks commented Jul 31, 2025

Uh oh!

chatman commented Jul 31, 2025

Uh oh!

mythrocks commented Jul 31, 2025

Uh oh!

mythrocks commented Jul 31, 2025

Uh oh!

mythrocks commented Jul 31, 2025

Uh oh!

mythrocks commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatman commented Jul 31, 2025

chatman commented Jul 3, 2025 •

edited

Loading

chatman commented Jul 4, 2025 •

edited

Loading

ldematte commented Jul 7, 2025 •

edited

Loading

chatman commented Jul 7, 2025 •

edited

Loading

chatman commented Jul 7, 2025 •

edited

Loading

chatman commented Jul 30, 2025 •

edited

Loading

mythrocks commented Jul 31, 2025 •

edited

Loading