Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Fix SemaphoreSlim throughput #13766

Merged
merged 1 commit into from
Sep 5, 2017
Merged

Fix SemaphoreSlim throughput #13766

merged 1 commit into from
Sep 5, 2017

Conversation

kouvel
Copy link

@kouvel kouvel commented Sep 2, 2017

In #13670, by mistake I made the spin loop infinite, that is now fixed.

As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem:

  • Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state.
  • Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary.
  • Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
Spin                                  Left score     Right score     ∆ Score %
------------------------------------  -------------  --------------  ---------
SemaphoreSlimLatency 1Pc              237.56 ±0.61%   249.42 ±0.25%      4.99%
SemaphoreSlimLatency 1Pc Delay        209.73 ±1.97%   167.61 ±0.84%    -20.08%
SemaphoreSlimLatency 2Pc              233.03 ±0.70%   235.25 ±0.22%      0.95%
SemaphoreSlimLatency 2Pc Delay        177.67 ±2.44%   171.36 ±0.74%     -3.55%
SemaphoreSlimThroughput 1Pc           214.28 ±9.92%  3112.52 ±0.93%   1352.52%
SemaphoreSlimWaitDrainRate 1Pc        404.05 ±6.84%   486.19 ±1.64%     20.33%
SemaphoreSlimWaitDrainRate 1Pc Delay  271.85 ±8.78%   424.44 ±1.65%     56.13%
SemaphoreSlimWaitDrainRate 2Pc        628.02 ±1.68%   623.87 ±1.58%     -0.66%
SemaphoreSlimWaitDrainRate 2Pc Delay  658.39 ±1.54%   651.09 ±1.41%     -1.11%
------------------------------------  -------------  --------------  ---------
Total                                 300.66 ±3.89%   423.66 ±1.03%     40.91%

In #13670, by mistake I made the spin loop infinite, that is now fixed.

As a result the numbers I had provided in that PR for SemaphoreSlim were skewed, and fixing it caused the throughput to get even lower. To compensate, I have found and fixed one culprit for the low throughput problem:
- Every release wakes up a waiter. Effectively, when there is a thread acquiring and releasing the semaphore, waiters don't get to remain in a wait state.
- Added a field to keep track of how many waiters were pulsed to wake but have not yet woken, and took that into account in Release() to not wake up more waiters than necessary.
- Retuned and increased the number of spin iterations. The total spin delay is still less than before the above PR.
@kouvel kouvel added area-System.Threading bug Product bug (most likely) labels Sep 2, 2017
@kouvel kouvel added this to the 2.1.0 milestone Sep 2, 2017
@kouvel kouvel self-assigned this Sep 2, 2017
@kouvel kouvel requested a review from tarekgh September 2, 2017 21:24
@kouvel
Copy link
Author

kouvel commented Sep 5, 2017

@dotnet-bot test CentOS7.1 x64 Debug Build and Test

@kouvel
Copy link
Author

kouvel commented Sep 5, 2017

This also fixes the CoreFX threading test hangs seen in dotnet/corefx#23784

@@ -56,7 +56,13 @@ public class SemaphoreSlim : IDisposable
// The number of synchronously waiting threads, it is set to zero in the constructor and increments before blocking the
// threading and decrements it back after that. It is used as flag for the release call to know if there are
// waiting threads in the monitor or not.
private volatile int m_waitCount;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

volatile [](start = 16, length = 8)

why you removed the volatile on this field?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always accessed from inside a lock (m_lockObj)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure accessing it inside the lock is sufficient? My understanding is the lock implemented using managed thread Id with some compareexchange mechanism. so theoretically, there is a possibility the nonvolatile field will not get updated inside the lock block. I am not sure though :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's at least an interlocked operation at acquiring and releasing the lock, so it should get updated properly, otherwise it would be a bug in the lock. Another possibility is that volatile ensures some kind of ordering inside the lock, but I don't see anything like that, all of the state updates and most of the reads are inside the lock so changing ordering would not observe any difference in value of the fields.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a couple of enter/leave paths:

These are C++ versions of the assembly code for the fast paths that actually run, there are interlocked operations on enter and exit. Also checked all of the contention paths and wait paths, they also issue an interlocked operation.

int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
Debug.Assert(m_countOfWaitersPulsedToWake <= waitCount);
int waitersToNotify = Math.Min(currentCount, waitCount) - m_countOfWaitersPulsedToWake;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int waitersToNotify = Math.Min(currentCount, waitCount) [](start = 16, length = 55)

Can't we instead of keep tracking of pulsed count, we just do

Min(currentCount, waitCount, releaseCount)? this should give the min # of threads we can pulse it to awake?
The issue I am seeing in m_countOfWaitersPulsedToWake, is not really reflecting the pulse count all the time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's sufficient, for instance suppose the max semaphore count is 1 and there are three threads:

  • T1 acquires the semaphore
  • Semaphore count is 0, T2 is waiting, T3 is waiting
  • T1 releases the semaphore - count is 1, releaseCount is 1, waitCount is 2. One waiter is pulsed, say it's T2.
  • T1 acquires and releases the semaphore again - T2 hasn't woken up yet, count is 1, releaseCount is 1, waitCount is 2. Another waiter is pulsed, this time it will be T3.

As long as the last step repeats, it should not pulse any waiter until the waiter pulsed on the first release wakes up and acknowledges the wake. While T1 repeatedly acquires and releases the lock, all waiters will be pulsed unnecessarily when only one of them can try for the lock anyway.

The new variable tries to track as best as it can (with the information available) how many waiters were pulsed but have not woken up yet, so that that number can be discounted on the next release. It is a conservative number, timeouts will also acknowledge wakeup as though the thread was pulsed because the info is not there to say whether the thread was pulsed. That just means an extra waiter will be pulsed, it shouldn't affect correctness.

@tarekgh
Copy link
Member

tarekgh commented Sep 5, 2017

LGTM

@kouvel
Copy link
Author

kouvel commented Sep 5, 2017

Thanks!

@kouvel kouvel merged commit e75f090 into dotnet:master Sep 5, 2017
@kouvel kouvel deleted the SemaphoreSlimFix branch September 5, 2017 21:52
@stephentoub
Copy link
Member

LGTM2

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading bug Product bug (most likely)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants