Description
Right now the libgreen scheduler implements a work stealing algorithm in tandem with a global sleeper list in order to fan out work among all schedulers when there is work to do.
Sadly, this scheme has pathological performance in simple situations like lots of threads contending on a mutex. The reason for this is that every time a thread gets off a mutex a remote scheduler is woken up to run the thread. This remote scheduler is woken up with a syscall. The remote scheduler will run the task, but immediately go back to sleep as the task goes back to sleep on the mutex.
This process of waking up a remote scheduler is very expensive and shows painfully in benchmarks. This problem essentially boils down to a problem where all task enqueues require a syscall. The reason for this is that on some workloads there's almost always a scheduler that is asleep as the system isn't fully loaded. If the system is loaded with work, none of this is a problem because no remote scheduler needs to be woken up.
Right now, there are two ideas for mitigating this cost of a syscall.
- Exponential backoff when sleeping. Right now whenever a scheduler goes to sleep, it falls back into
epoll
and gets woken up viawrite
. Additionally, it puts a handle on a global sleeper list for other schedulers to grab and then wake up. Instead, a scheduler could have some sort of exponential backoff of sleeping. Only after the backoff was completed would a handle be placed on the sleeper list. The goal for this would be to avoid write syscalls because the handles wouldn't actually be on the list, but other schedulers would still successfully steal work. This exponential backoff would need to be tuneable, and it probably shouldn't happen if there are active I/O handles. - Schedulers should only wake up two other schedulers. Right now there is a global sleeper list where any scheduler can wake up any other scheduler. In addition to being a point of contention, this means that if any scheduler is sleeping that a task enqueue will be a syscall. On systems of lots of cores, I would imagine it's fairly likely that small programs don't fill up all the cores. In order to mitigate this problem, we would arrange the pool of schedulers as a ring of schedulers. Each scheduler would be responsible for waking up its left and right neighbors, but that's it. With this scheme, if one scheduler had lots of serial work, it would occasionally wake up its neighbors, but that's it. Additionally, its neighbors would likely be in the exponential backoff stage before they went to sleep, so there would be no syscalls.
I'm guessing that combining these two strategies will allow us to still distribute work to all cores quickly if there is a truly parallel workload, and also work well on many-task workloads that are mostly sequential (due to a mutex or some such). This is just a guess though, and definitely needs investigation.