Skip to content

Conversation

chatman
Copy link
Contributor

@chatman chatman commented Jul 3, 2025

Fix CuVSResources usage for multithreaded queries

* Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex
* Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources.

Copy link

copy-pr-bot bot commented Jul 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@chatman
Copy link
Contributor Author

chatman commented Jul 4, 2025

Edit: deleted the comment, since I was earlier assuming this was a benign issue with just error reporting, but turns out to be a real issue.

@chatman chatman changed the title [Java] Multithreaded querying via different resources fails with ugly message [Java][Fix] Multithreaded querying fails without synchronization Jul 7, 2025
@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

While adding the synchronized here makes it work, it just proves that there are some underlying concurrency issues that need solving. I don't believe this is the final fix.

@ldematte
Copy link
Contributor

ldematte commented Jul 7, 2025

Hey @chatman, I observed a similar issue in benchmarks; my gut feeling is this is not a synchronization issue, but rather a memory exhaustion problem that becomes apparent when you are "hammering" the API with enough calls. It can be that the synchronization is simply slowing things down enough to let the resources free and make the test pass.

In my case, this was due to the fact that datasets were freed "too late"; now the arena on which they are bound is freed much earlier, and I don't have the same problem anymore.
I was benchmarking the index side, but I think the search side is suffering from the same issues.
Can you try #1069 (without synchronized) and see if you still have the same issue?

@ldematte
Copy link
Contributor

ldematte commented Jul 7, 2025

To give you some context: #1069 gets rid of the "global allocations" on the CuVSResources arena.
It may still not be enough; if I read the message correctly, it is cuvsRMMAlloc that is failing, right?
We pair it correctly with cuvsRMMFree already, but I guess if you run enough searches with enough concurrently, you can exhaust device memory.
To see if the issue is memory exhaustion vs synchronization, you can wrap cuvsRMMAlloc/cuvsRMMFree in some function that saves (in a static) the number of bytes currently allocated, and see what this number is when your tests blow up. If the number is far from your device free memory, then this is really a synchronization issue. If it's near your device capacity, it is likely you "just" using too much memory and you need to throttle your searches.

Ummm actually that might not be enough, as it will not account for memory allocated internally by cuvsCagraSearch :/

@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

Can you try #1069 (without synchronized) and see if you still have the same issue?

Sure, trying!

@ldematte
Copy link
Contributor

ldematte commented Jul 7, 2025

You can try and keep an eye via nvidia-smi as your tests run; not ideal, but should give you a feel if you are reaching your GPU memory limits

@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

my gut feeling is this is not a synchronization issue, but rather a memory exhaustion problem that becomes apparent when you are "hammering" the API with enough calls

FYI, in my testing (test added in this PR), even with numThreads=2, the search() fails without the synchronized.

@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

You can try and keep an eye via nvidia-smi as your tests run; not ideal, but should give you a feel if you are reaching your GPU memory limits

I have a 48GB GPU, and the index I was trying has barely 25k vectors, and just a few search queries. I'm pretty sure it is a misreporting regarding RMM out_of_memory.

int vectorDimension = numQueries > 0 ? query.getQueryVectors()[0].length : 0;
Arena arena = resources.getArena();
// Use the shared arena from resources instead of creating a confined arena
// to avoid thread safety issues when multiple threads perform searches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true; if this is a memory issue, using the shared arena will make things worse as it will bound the allocated memory to the lifetime of the resource (they will be freed "too late").

Also, a local confined arena is inherently thread safe, as it is "private" to the thread stack calling this function.

(And also, the localArena is allocated here, but never used -- we use the Arena from Arena arena = resources.getArena(); a few lines below, so effectively the code is the same -- BTW I think this is a mistake, and I'm fixing it in #1069)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not saying there aren't threading issues here, just saying that the localArena is not the culprit.

@ldematte
Copy link
Contributor

ldematte commented Jul 7, 2025

I have a 48GB GPU, and the index I was trying has barely 25k vectors, and just a few search queries. I'm pretty sure it is a misreporting regarding RMM out_of_memory.

Yes, with this numbers if very likely not a memory exhaustion issue. It can be a memory corruption issue, that might be caused by synchronization issues.

While adding the synchronized here makes it work, it just proves that there are some underlying concurrency issues that need solving. I don't believe this is the final fix.

I agree, we need to understand what is breaking and fix it

@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

Can you try #1069 (without synchronized) and see if you still have the same issue?

Sure, trying!

The test I added here passes in your #1069 branch without any changes. That's a good sign :-) Looking deeper.

Oh no, I was mistaken:

[INFO] Running com.nvidia.cuvs.CagraMultiThreadStabilityIT
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CuVSTestCase - Test seed: 9A6FE6A6989BED73
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Multi-threaded stability test initialized
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Starting synchronized search requirement test:
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   Dataset: 10000x256
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   Threads: 16, Queries per thread: 50
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT -   This test demonstrates why synchronized keyword is required
[TEST-CagraMultiThreadStabilityIT.testSynchronizedSearchRequirement-seed#[9A6FE6A6989BED73]] INFO com.nvidia.cuvs.CagraMultiThreadStabilityIT - Creating CAGRA index for synchronization test...
[695196][18:48:35:165747][info  ] optimizing graph
[695196][18:48:35:207521][info  ] Graph optimized, creating index
Aborted (core dumped)
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[WARNING] Corrupted channel by directly writing to native stream in forked JVM 1. See FAQ web page and the dump file /home/ishan/code/cuvs/java/cuvs-java/target/failsafe-reports/2025-07-07T18-48-33_996-jvmRun1.dumpstream
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  12.621 s
[INFO] Finished at: 2025-07-07T18:48:36+05:30
[INFO] ------------------------------------------------------------------------

There's a JVM crash, but still there's BUILD SUCCESS (which fooled me).


float[][] dataset = generateRandomDataset(dataSize, dimensions);

try (CuVSResources resources = CuVSResources.create()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, now I can see it.
I don't think this will work, not with the current cuvs implementation. We were discussing this internally and I think we also got some (preliminary) confirmation from cuvs developers: you will need one resource per-thread. Or, in other words, you can access a resource via different threads, but you cannot access them concurrently.

@ldematte
Copy link
Contributor

ldematte commented Jul 7, 2025

There's a JVM crash, but still there's BUILD SUCCESS (which fooled me).

Yes, the JUnit test runner will do that, you need to explicitly handle failures (and sometimes you cannot/it's hard to do)

@chatman
Copy link
Contributor Author

chatman commented Jul 7, 2025

We were discussing this internally and I think we also got some (preliminary) confirmation from cuvs developers:
you will need one resource per-thread. Or, in other words, you can access a resource via different threads, but you cannot access them concurrently.

FYI I'm trying to reason this out, and I guess that the search() calls need to be executed in a separate cuda streams in order to succeed (one stream per call). There's cudaStreamCreate or something like that, and I'm wondering if that is needed / that will help. Currently, the same cuda stream is being shared across multiple threads, and hence failing. Does that make sense @ldematte @benfred @cjnolet ?

@cjnolet
Copy link
Member

cjnolet commented Jul 7, 2025

Currently, the same cuda stream is being shared across multiple threads, and hence failing.

Are you explicitly creating a stream or are you using the stream that gets created in the underlying raft::resources by default? If the latter, you are not actually using the same stream across raft::resources. We are very careful to note that raft::resources is not claimed to be thread safe. That being said, we often provide a default option of adding a synchronization after every cuVS function when an explicit resources object has not been provided. It wouldn't help this case, but I'd go as far as to argue that this is not a valid usage of raft::resources, given the documentation. I'm not sure it's up to cuVS to protect to against this when it's already being documented. We can certainly highlight the documentation better within cuVS itself.

@ldematte
Copy link
Contributor

ldematte commented Jul 8, 2025

I'd go as far as to argue that this is not a valid usage of raft::resources, given the documentation

I think you are right, that is my impression too.

We can certainly highlight the documentation better within cuVS itself.

That would be very helpful, especially for C/Java users where raft::resources is not explicitly exposed (it is hidden in the API implementation).

@ldematte
Copy link
Contributor

ldematte commented Jul 8, 2025

@cjnolet @chatman @ChrisHegarty I drafted a PR that might help: #1089. It introduces the possibility to check that the threading constraints for raft::resources are respected.

It explicitly model the concept of access scope for a resource; the Javadoc there explains the threading constraints. It also adds a (optional) decorator to CuVSResources that does enforce the one-thread-at-time behaviour (and I'm using that in tests -- actually at the moment it is test only, but it could be a utility in the main library if we want, or can be just example code that cuvs-java users can borrow).

@chatman
Copy link
Contributor Author

chatman commented Jul 30, 2025

@mythrocks This is now ready for your review. @narangvivek10 and I have tested this extensively using the provided CagraMultithreadedStabilityIT, as well as tested using Lucene.

While earlier (without separate resources per thread) we were encountering lots of "out_of_memory" errors or crashes, now it is no longer the case as per our testing.

*
* @param resources the CuVSResources instance to use for this query (must not be shared between threads)
*/
public Builder(CuVSResources resources) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to ask why this seems to be breaking the fluent-interface semantics of the Builder.

It looks like this is to necessitate specifying the CuVSResources when building a query. Yep, that makes sense to do. 👍

@mythrocks
Copy link
Contributor

/ok to test 6bd62ec


/**
* Builder helps configure and create an instance of BruteForceQuery.
* Builder helps configure and create an instance of HnswQuery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh! 👍

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Pending tests.

@mythrocks
Copy link
Contributor

Hmm. Looks like there are Java test failures. :/

We could consider pushing this to an early merge to 25.10.

@chatman
Copy link
Contributor Author

chatman commented Jul 31, 2025

Error: Errors:
Error: CagraBuildAndSearchIT.testIndexingAndSearchingFlowInDifferentThreads ? ThreadLeak 1 thread leaked from TEST scope at testIndexingAndSearchingFlowInDifferentThreads(com.nvidia.cuvs.CagraBuildAndSearchIT):

  1. Thread[id=2057, name=pool-207-thread-1, state=WAITING, group=TGRP-CagraBuildAndSearchIT]
    at java.base/jdk.internal.misc.Unsafe.park(Native Method)
    at java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:221)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:990)
    at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
    at java.base/java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
    at java.base/java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1002)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1158)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.base/java.lang.Thread.run(Thread.java:1570)

This is a very strange failure. That test passed but there's a thread leak. I'm not sure that this is related to the change here. I tried running this test separately (with the seed as well as without it) as well as the full tests suite 25 times each, and I'm unable to reproduce this failure.

@mythrocks Can you please trigger the test again to see if it reproduces or just an intermittent failure?

@mythrocks
Copy link
Contributor

/ok to test 6bd62ec

@mythrocks
Copy link
Contributor

I'm re-running it now. I'll keep an eye out for the result, and see about taking a closer look today.

@mythrocks
Copy link
Contributor

The re-run did pass. Looks like a stray failure that, at worst, affects the test, not the functionality.

I'll check for the viability of this being merged to branch-25.08.

@mythrocks
Copy link
Contributor

mythrocks commented Jul 31, 2025

I just checked. We are properly in code-freeze now. Let's retarget to 25.10.

We can check if the test failures happen to recur in the new dev branch.

@chatman
Copy link
Contributor Author

chatman commented Jul 31, 2025

I just checked. We are properly in code-freeze now. Let's retarget 25.10.

Ouch, we got lucked out, I guess. Thanks for looking into this. I'll re-target this PR for 25.10 now.

@chatman chatman changed the base branch from branch-25.08 to branch-25.10 July 31, 2025 17:25
@mythrocks
Copy link
Contributor

Thanks for retargeting, @chatman. Would you mind please merging the latest commits from branch-25.10? I don't have write access to the SearchScale branches. :/

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2025

@chatman can you please merge the current branch-25.10 into your branch here?

@chatman
Copy link
Contributor Author

chatman commented Jul 31, 2025

@mythrocks @cjnolet I just hit the update button. Haven't verified that it compiles or passes the tests on this branch.

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2025

/ok to test b0bf5bd

@mythrocks
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 81e22d3 into rapidsai:branch-25.10 Aug 1, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Elasticsearch + cuVS Team Aug 1, 2025
rapids-bot bot pushed a commit that referenced this pull request Aug 5, 2025
As a follow up to #1089 and #1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: #1209
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
…idsai#1082)

Fix CuVSResources usage for multithreaded queries
    
    * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex
    * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources.

Authors:
  - Ishan Chattopadhyaya (https://github.com/chatman)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1082
lowener pushed a commit to lowener/cuvs that referenced this pull request Aug 11, 2025
)

As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1209
enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request Aug 22, 2025
…idsai#1082)

Fix CuVSResources usage for multithreaded queries
    
    * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex
    * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources.

Authors:
  - Ishan Chattopadhyaya (https://github.com/chatman)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1082
enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request Aug 22, 2025
)

As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time.

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1209
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Introduces a non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants