Skip to content

Investigate improving scheduler work-stealing efficiency #11730

Closed
@alexcrichton

Description

@alexcrichton

Right now the libgreen scheduler implements a work stealing algorithm in tandem with a global sleeper list in order to fan out work among all schedulers when there is work to do.

Sadly, this scheme has pathological performance in simple situations like lots of threads contending on a mutex. The reason for this is that every time a thread gets off a mutex a remote scheduler is woken up to run the thread. This remote scheduler is woken up with a syscall. The remote scheduler will run the task, but immediately go back to sleep as the task goes back to sleep on the mutex.

This process of waking up a remote scheduler is very expensive and shows painfully in benchmarks. This problem essentially boils down to a problem where all task enqueues require a syscall. The reason for this is that on some workloads there's almost always a scheduler that is asleep as the system isn't fully loaded. If the system is loaded with work, none of this is a problem because no remote scheduler needs to be woken up.

Right now, there are two ideas for mitigating this cost of a syscall.

  1. Exponential backoff when sleeping. Right now whenever a scheduler goes to sleep, it falls back into epoll and gets woken up via write. Additionally, it puts a handle on a global sleeper list for other schedulers to grab and then wake up. Instead, a scheduler could have some sort of exponential backoff of sleeping. Only after the backoff was completed would a handle be placed on the sleeper list. The goal for this would be to avoid write syscalls because the handles wouldn't actually be on the list, but other schedulers would still successfully steal work. This exponential backoff would need to be tuneable, and it probably shouldn't happen if there are active I/O handles.
  2. Schedulers should only wake up two other schedulers. Right now there is a global sleeper list where any scheduler can wake up any other scheduler. In addition to being a point of contention, this means that if any scheduler is sleeping that a task enqueue will be a syscall. On systems of lots of cores, I would imagine it's fairly likely that small programs don't fill up all the cores. In order to mitigate this problem, we would arrange the pool of schedulers as a ring of schedulers. Each scheduler would be responsible for waking up its left and right neighbors, but that's it. With this scheme, if one scheduler had lots of serial work, it would occasionally wake up its neighbors, but that's it. Additionally, its neighbors would likely be in the exponential backoff stage before they went to sleep, so there would be no syscalls.

I'm guessing that combining these two strategies will allow us to still distribute work to all cores quickly if there is a truly parallel workload, and also work well on many-task workloads that are mostly sequential (due to a mutex or some such). This is just a guess though, and definitely needs investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-runtimeArea: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflowsI-slowIssue: Problems and improvements with respect to performance of generated code.P-mediumMedium priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions