-
Notifications
You must be signed in to change notification settings - Fork 131
[Java][Fix] Multithreaded querying fails without synchronization #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Java][Fix] Multithreaded querying fails without synchronization #1082
Conversation
Edit: deleted the comment, since I was earlier assuming this was a benign issue with just error reporting, but turns out to be a real issue. |
While adding the |
Hey @chatman, I observed a similar issue in benchmarks; my gut feeling is this is not a synchronization issue, but rather a memory exhaustion problem that becomes apparent when you are "hammering" the API with enough calls. It can be that the synchronization is simply slowing things down enough to let the resources free and make the test pass. In my case, this was due to the fact that datasets were freed "too late"; now the arena on which they are bound is freed much earlier, and I don't have the same problem anymore. |
To give you some context: #1069 gets rid of the "global allocations" on the CuVSResources arena. Ummm actually that might not be enough, as it will not account for memory allocated internally by |
Sure, trying! |
You can try and keep an eye via |
FYI, in my testing (test added in this PR), even with numThreads=2, the search() fails without the |
I have a 48GB GPU, and the index I was trying has barely 25k vectors, and just a few search queries. I'm pretty sure it is a misreporting regarding RMM out_of_memory. |
int vectorDimension = numQueries > 0 ? query.getQueryVectors()[0].length : 0; | ||
Arena arena = resources.getArena(); | ||
// Use the shared arena from resources instead of creating a confined arena | ||
// to avoid thread safety issues when multiple threads perform searches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is true; if this is a memory issue, using the shared arena will make things worse as it will bound the allocated memory to the lifetime of the resource (they will be freed "too late").
Also, a local confined arena is inherently thread safe, as it is "private" to the thread stack calling this function.
(And also, the localArena is allocated here, but never used -- we use the Arena from Arena arena = resources.getArena();
a few lines below, so effectively the code is the same -- BTW I think this is a mistake, and I'm fixing it in #1069)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I'm not saying there aren't threading issues here, just saying that the localArena is not the culprit.
Yes, with this numbers if very likely not a memory exhaustion issue. It can be a memory corruption issue, that might be caused by synchronization issues.
I agree, we need to understand what is breaking and fix it |
Oh no, I was mistaken:
There's a JVM crash, but still there's BUILD SUCCESS (which fooled me). |
|
||
float[][] dataset = generateRandomDataset(dataSize, dimensions); | ||
|
||
try (CuVSResources resources = CuVSResources.create()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, now I can see it.
I don't think this will work, not with the current cuvs implementation. We were discussing this internally and I think we also got some (preliminary) confirmation from cuvs developers: you will need one resource per-thread. Or, in other words, you can access a resource via different threads, but you cannot access them concurrently.
Yes, the JUnit test runner will do that, you need to explicitly handle failures (and sometimes you cannot/it's hard to do) |
FYI I'm trying to reason this out, and I guess that the search() calls need to be executed in a separate cuda streams in order to succeed (one stream per call). There's |
Are you explicitly creating a stream or are you using the stream that gets created in the underlying |
I think you are right, that is my impression too.
That would be very helpful, especially for C/Java users where raft::resources is not explicitly exposed (it is hidden in the API implementation). |
@cjnolet @chatman @ChrisHegarty I drafted a PR that might help: #1089. It introduces the possibility to check that the threading constraints for It explicitly model the concept of access scope for a resource; the Javadoc there explains the threading constraints. It also adds a (optional) decorator to |
@mythrocks This is now ready for your review. @narangvivek10 and I have tested this extensively using the provided CagraMultithreadedStabilityIT, as well as tested using Lucene. While earlier (without separate resources per thread) we were encountering lots of "out_of_memory" errors or crashes, now it is no longer the case as per our testing. |
* | ||
* @param resources the CuVSResources instance to use for this query (must not be shared between threads) | ||
*/ | ||
public Builder(CuVSResources resources) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was about to ask why this seems to be breaking the fluent-interface semantics of the Builder.
It looks like this is to necessitate specifying the CuVSResources
when building a query. Yep, that makes sense to do. 👍
/ok to test 6bd62ec |
|
||
/** | ||
* Builder helps configure and create an instance of BruteForceQuery. | ||
* Builder helps configure and create an instance of HnswQuery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Pending tests.
Hmm. Looks like there are Java test failures. :/ We could consider pushing this to an early merge to 25.10. |
This is a very strange failure. That test passed but there's a thread leak. I'm not sure that this is related to the change here. I tried running this test separately (with the seed as well as without it) as well as the full tests suite 25 times each, and I'm unable to reproduce this failure. @mythrocks Can you please trigger the test again to see if it reproduces or just an intermittent failure? |
/ok to test 6bd62ec |
I'm re-running it now. I'll keep an eye out for the result, and see about taking a closer look today. |
The re-run did pass. Looks like a stray failure that, at worst, affects the test, not the functionality. I'll check for the viability of this being merged to |
I just checked. We are properly in code-freeze now. Let's retarget to 25.10. We can check if the test failures happen to recur in the new dev branch. |
Ouch, we got lucked out, I guess. Thanks for looking into this. I'll re-target this PR for 25.10 now. |
Thanks for retargeting, @chatman. Would you mind please merging the latest commits from |
@chatman can you please merge the current |
@mythrocks @cjnolet I just hit the update button. Haven't verified that it compiles or passes the tests on this branch. |
/ok to test b0bf5bd |
/merge |
As a follow up to #1089 and #1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: #1209
…idsai#1082) Fix CuVSResources usage for multithreaded queries * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources. Authors: - Ishan Chattopadhyaya (https://github.com/chatman) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1082
) As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1209
…idsai#1082) Fix CuVSResources usage for multithreaded queries * Added resources as a constructor argument for Builder classes in CagraIndex, HNSWIndex, TieredIndex and BruteForceIndex * Added a multithreaded CAGRA stability test (which fails with improper resources usage). This test uses CheckedCuVSResources. Authors: - Ishan Chattopadhyaya (https://github.com/chatman) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1082
) As a follow up to rapidsai#1089 and rapidsai#1082, this PR introduces a decorator to ensure that access to a shared CuVSResource is synchronized. This could be useful in cases where application logic is complex and it's not easy to efficiently and cleanly guarantee that access to a CuVSResources is done by at most one thread at each given time. Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - MithunR (https://github.com/mythrocks) URL: rapidsai#1209
Fix CuVSResources usage for multithreaded queries