-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
In #13670, by mistake I made the spin loop infinite, that is now fixed. As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: - Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state. - Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary. - Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
@dotnet-bot test CentOS7.1 x64 Debug Build and Test |
This also fixes the CoreFX threading test hangs seen in dotnet/corefx#23784 |
@@ -56,7 +56,13 @@ public class SemaphoreSlim : IDisposable | |||
// The number of synchronously waiting threads, it is set to zero in the constructor and increments before blocking the | |||
// threading and decrements it back after that. It is used as flag for the release call to know if there are | |||
// waiting threads in the monitor or not. | |||
private volatile int m_waitCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
volatile [](start = 16, length = 8)
why you removed the volatile on this field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always accessed from inside a lock (m_lockObj)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure accessing it inside the lock is sufficient? My understanding is the lock implemented using managed thread Id with some compareexchange mechanism. so theoretically, there is a possibility the nonvolatile field will not get updated inside the lock block. I am not sure though :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's at least an interlocked operation at acquiring and releasing the lock, so it should get updated properly, otherwise it would be a bug in the lock. Another possibility is that volatile ensures some kind of ordering inside the lock, but I don't see anything like that, all of the state updates and most of the reads are inside the lock so changing ordering would not observe any difference in value of the fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a couple of enter/leave paths:
- https://github.com/dotnet/coreclr/blob/master/src/vm/syncblk.inl#L47
- https://github.com/dotnet/coreclr/blob/master/src/vm/syncblk.inl#L218
These are C++ versions of the assembly code for the fast paths that actually run, there are interlocked operations on enter and exit. Also checked all of the contention paths and wait paths, they also issue an interlocked operation.
int waitersToNotify = Math.Min(releaseCount, waitCount); | ||
for (int i = 0; i < waitersToNotify; i++) | ||
Debug.Assert(m_countOfWaitersPulsedToWake <= waitCount); | ||
int waitersToNotify = Math.Min(currentCount, waitCount) - m_countOfWaitersPulsedToWake; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int waitersToNotify = Math.Min(currentCount, waitCount) [](start = 16, length = 55)
Can't we instead of keep tracking of pulsed count, we just do
Min(currentCount, waitCount, releaseCount)? this should give the min # of threads we can pulse it to awake?
The issue I am seeing in m_countOfWaitersPulsedToWake, is not really reflecting the pulse count all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's sufficient, for instance suppose the max semaphore count is 1 and there are three threads:
- T1 acquires the semaphore
- Semaphore count is 0, T2 is waiting, T3 is waiting
- T1 releases the semaphore - count is 1, releaseCount is 1, waitCount is 2. One waiter is pulsed, say it's T2.
- T1 acquires and releases the semaphore again - T2 hasn't woken up yet, count is 1, releaseCount is 1, waitCount is 2. Another waiter is pulsed, this time it will be T3.
As long as the last step repeats, it should not pulse any waiter until the waiter pulsed on the first release wakes up and acknowledges the wake. While T1 repeatedly acquires and releases the lock, all waiters will be pulsed unnecessarily when only one of them can try for the lock anyway.
The new variable tries to track as best as it can (with the information available) how many waiters were pulsed but have not woken up yet, so that that number can be discounted on the next release. It is a conservative number, timeouts will also acknowledge wakeup as though the thread was pulsed because the info is not there to say whether the thread was pulsed. That just means an extra waiter will be pulsed, it shouldn't affect correctness.
LGTM |
Thanks! |
LGTM2 |
In #13670, by mistake I made the spin loop infinite, that is now fixed.
As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem: