-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DefaultRwLock accumulates write-waiters, eventually fails to write lock #13163
Comments
I'm working on a PR and I've attached a diff here with a very minimal fix and test. I think more general tests should be added as well with the PR, since there are currently no tests at all for RwLock, and that's what I'll be working on. The (fairly obvious) bug is that WRITE is never subtracted, only added, by the lock method. The fix is to subtract it before lock returns, without changing any other code or behavior. Note that IS_WRITING is set before subtracting WRITE, so the "writing" state checked by other methods ( Let me know if for some reason you don't want me to proceed, e.g., if RwLock is not intended to work (I say this only because there are no tests). |
The reason that the reproduction at the top hangs is because the write count overflows into the read count (they're adjacent bit fields in a single integer), and if the read count is non-zero the lock method will wait for a semaphore that is never fired. |
You may ask, why not fix it by subtracting WRITE in the unlock method? The reason is that WRITE is not added in the tryLock method, and we can't have unbalanced additions and subtractions. This is also good justification for subtracting it before the lock method returns -- locked with a zero write count is already a known valid state. |
cc @kprotty |
This RwLock was incorrectly ported from this repo. For the //const state = @atomicRmw(usize, &rwl.state, .Or, IS_WRITING, .SeqCst);
const state = @atomicRmw(usize, &rwl.state, .Add, IS_WRITING -% WRITER, .SeqCst); A while back I started rewriting some of the sync primitives but stuff came up and never got to RwLock. If it does end up being fixed, I'd suggest taking that opportunity to make it smaller & based on Futex like this algorithm or the writer-preferring version of it Rust stdlib ended up using. |
Thanks for commenting on this @kprotty!
That makes sense and is a direct port of the C++ code, so should not be risky. I'll try that instead of the separate I have another question for you. Let's say we decide to keep this implementation and fix it as above, at least temporarily. I also noticed that the C++ implementation uses Acquire/Release instead of SeqCst, and I think this would make a big performance difference. Do you see any reason not to use Acquire/Release (matching the C++ code) in the Zig implementation? |
There's a lot of orderings that could be weakened in the implementation, and SeqCst isn't expensive. On x86, all atomic rmw instructions use the In regards to that line in particular, I believe it can be Acquire? It at least needs to have the READER sub in |
I didn't mean to substitute AcqRel for all SeqCst, I meant to use the same orderings (Acquire, Release or AcqRel) that are used in the C++ implementation, in all cases. Looking further, in the C++ code, AcqRel is used in all rmw cases so there is no difference there. Acquire is used only for loads prior to use of AcqRel, so I guess you're right and the difference would be very minor. Thanks for your reply. |
Here is a diff with the modified fix (from kprotty) along with a test for the specific problem and general testing of RwLock (there were no tests previously). I think I should create a PR with this diff now, and that a more efficient rewrite (as kprotty suggests) be done separately. Sound Ok? @andrewrk |
I think you could've went ahead and created the PR. Then, the review could happen there as it's easier to do so vs. a text file. |
…ails to write lock (#13180) * Fix for: DefaultRwLock accumulates write-waiters, eventually fails to write lock #13163 * Comment out debug.print at the end of the last test. * Code formatting * - use equality test after lock/unlock rather than peeking into internals. however, this is still implementation specific and only done for DefaultRwLock. - add num_reads maximum to ensure that reader threads stop if writer threads are starved - use relaxed orderings for the read atomic counter - don't check at the end for non-zero read ops, since the reader threads may only run once if they are starved * More review changes - Monotonic is sufficient for incrementing the reads counter
Zig Version
0.10.0-dev.4280+c3d67c5c4
Steps to Reproduce
Run the following test and expect it to run to completion (not hang).
Let it run for several minutes since it takes some time to run that many iterations, but eventually it will stop consuming CPU and hang waiting for a semaphore in the
DefaultRwLock.lock
method.Expected Behavior
Prints the "About to lockShared..." message after several minutes and finishes running (no errors).
Actual Behavior
After several minutes it will stop consuming CPU and hang waiting for a semaphore in the
DefaultRwLock.lock
method. Although a large number (2^30 I believe) write locks must be taken before the problem occurs, it could definitely occur in a long running program.Note that the Zig version doesn't matter. It fails on latest master, 0.9.0, and probably every other version, since the DefaultRwLock code has not changed and is neglecting to decrement the writes-waiting counter. I'll provide more info and a potential fix later in the issue.
The problem was originally found by
@Curious Thing
on Discord here:https://discord.com/channels/605571803288698900/1028554539210645586
The text was updated successfully, but these errors were encountered: