perf: provide safepoints for pre-emptive scheduling #72

lithdew · 2021-07-02T11:00:48Z

I'd like to raise this as more of a discussion, though it seems like there may be a way to mark safepoints in assembly code that the Go scheduler may pre-empt on.

@aclements (comment): ... with some extra annotations in the assembly to indicate registers containing pointers it will become preemptible without any extra work or run-time overhead to reach an explicit safe-point.

https://go.googlesource.com/proposal/+/master/design/24543/safe-points-everywhere.md: By default, the runtime cannot safely preempt assembly code since it won‘t know what registers contain pointers. As a follow-on to the work on safe-points everywhere, we should audit assembly in the standard library for non-preemptible loops and annotate them with register maps. In most cases this should be trivial since most assembly never constructs a pointer that isn’t shadowed by an argument, so it can simply claim there are no pointers in registers. We should also document in the Go assembly guide how to do this for user code.

Given that crypto code in general is CPU-bound, and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks, providing safepoints would improve the overall latency of all tasks in Go programs that rely on crypto.

Where this would prove valuable is that I've been working on a Go program that makes heavy use of network I/O and ed25519 batch verification - the biggest bottleneck of the program right now is that noticeably, almost all of the goroutines in the program that are I/O-bound are being starved and driven nearly to a complete halt whenever there is any sort of crypto code being executed in a separate goroutine. I noticed this behavior by recording traces of the program and opening them up in Go's trace viewer.

The Zig/C equivalent of the same program, which makes use of dedicated thread pools for I/O-bound code and CPU-bound code, has significantly lower latencies and scheduler contention in comparison.

Now, I'd like this to be more of a discussion because I have two open questions:

The possibility of marking safepoints in assembly code is only documented in the Go proposal, though I haven't seen any examples of assembly code doing it in i.e. the Go standard library so far. Is the ability to mark safepoints actually possible right now in the latest version of Go?
Are there any security implications to encouraging pre-emption for such crypto-related code?

Yawning · 2021-07-02T14:38:35Z

Oh boy.

Note: I just quickly skimmed the related issues, so my understanding of what the runtime does is surface-level at best). I assume that edwardsMultiscalarMulPippengerVartimeVector is the routine that's causing the most grief, since that is what does the heavy lifting for the batch sizes that you are doing.

The possibility of marking safepoints in assembly code is only documented in the Go proposal, though I haven't seen any examples of assembly code doing it in i.e. the Go standard library so far. Is the ability to mark safepoints actually possible right now in the latest version of Go?

As far as I can tell, this is not possible, might not be needed, and in may not help that much. From what I can tell by skimming the issues, the runtime's (new-ish) async preemption mechanism is based on periodically sending SIGURG to the process (once every 10? ms), and even prior to the introduction of the async preemption mechanism, function calls are preemption points (specifically when a stack frame is allocated).

For the sake of my implementation sanity, the underlying operations in what I assume is the problematic routine are already broken up into sub-routines, and while the multiscalar multiply does have a number of tight loops, each loop iteration already has numerous occasions for the scheduler to do the right thing (since each call to AddExtendedCached and SubExtendedCached, calls at least one assembly routine that uses a stack frame vecMul_AVX2).

One thing that you could try is to insert explicit call(s) to runtime.Gosched(). Intuitively I suspect that the following loops are good candidates to investigate, though the frequency of the explicit-yield probably needs benchmarking as to not overly-impact non-concurrent performance.

curve25519-voi/curve/scalar_mul_pippenger.go

Line 177 in fc681bf

for _, scalar := range scalars {

curve25519-voi/curve/scalar_mul_pippenger.go

Line 184 in fc681bf

for i, point := range points {

curve25519-voi/curve/scalar_mul_pippenger.go

Line 198 in fc681bf

for i := 0; i < size; i++ {

Are there any security implications to encouraging pre-emption for such crypto-related code?

Technically yes. The routines that would need additional pre-emption points are also used for things that work on secret material, and so additional copies of sensitive intermediaries will get spilled somewhere. That said, such things are outside of the threat model of this library, due it it being an extremely difficult problem to solve in general, especially with Go.

(Note: I am technically out of the office for the next week or so. Responses may be delayed, but I am interested in getting this library to work well for you.)

Yawning · 2021-07-02T14:58:21Z

Addendum (which naturally occurred to me after I turned off my laptop....)

and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks

You could try doing just that as well like thus:

Bump up the number of OS threads the scheduler will create (runtime.GOMAXPROCS).
Spawn a pool of dedicated crypto workers, and give each of them a dedicated OS thread (runtime.LockOSThread/runtime.UnlockOSThread).

This should push the responsibility of "making the right thing happen" to being the kernel's problem rather than the Go scheduler's.

lithdew · 2021-07-02T15:16:12Z

Thanks for the quick response! I'll report how things go after inserting a few explicit yield points in the areas you suggested, and after offloading all crypto code to a dedicated goroutine pool for CPU-bound tasks.

The idea of creating a goroutine pool dedicated to executing code by making use of runtime.{Lock, Unlock}OSThread() never occurred to me 🤔.

lithdew · 2021-07-02T17:36:42Z

EDIT: Whether or not the solution below actually solves the problem of I/O-bound goroutines being starved by CPU-bound code, take it with a grain of salt. I'm going to try to run my benchmarks on a few different environments and take some time to look at the trace logs just to make sure. IIRC I heard long ago that there were certain cases/instances where runtime.LockOSThread() may not actually dedicate a single goroutine to a single thread the way we'd necessarily want it to.

I went for the latter approach of offloading the execution of all crypto code to a dedicated goroutine pool for CPU-bound tasks, and it looks all of the I/O-bound goroutines are no longer being starved 😌.

The pool is a bit makeshift (i.e. hardcoded constants, forcefully allocates GOMAXPROCS additional processes), though here's the code for it in case it may be helpful.

package rheia

import (
	"runtime"
	"sync"
	"sync/atomic"
)

const PoolWorkerPipelineSize = 1

type PoolTask struct {
	wg *sync.WaitGroup
	fn func()
}

type Pool struct {
	closed uint32
	queue  struct {
		sync.Mutex
		cond    sync.Cond
		entries []PoolTask
	}
	wg sync.WaitGroup
}

func NewPool() *Pool {
	pool := new(Pool)
	pool.queue.cond.L = &pool.queue.Mutex
	return pool
}

func (p *Pool) Close() {
	if !atomic.CompareAndSwapUint32(&p.closed, 0, 1) {
		return
	}
	p.queue.cond.Broadcast()
	p.wg.Wait()
}

func (p *Pool) Run() {
	count := runtime.GOMAXPROCS(0)
	runtime.GOMAXPROCS(count * 2)

	p.wg.Add(count)
	for i := 0; i < count; i++ {
		go func() {
			runtime.LockOSThread()
			defer runtime.UnlockOSThread()

			defer p.wg.Done()

			for {
				p.queue.Lock()
				for atomic.LoadUint32(&p.closed) != 1 && len(p.queue.entries) == 0 {
					p.queue.cond.Wait()
				}
				count, overflow := len(p.queue.entries), false
				if count > PoolWorkerPipelineSize {
					count, overflow = PoolWorkerPipelineSize, true
				}
				tasks := p.queue.entries[:count]
				p.queue.entries = p.queue.entries[count:]
				if overflow {
					p.queue.cond.Signal()
				}
				p.queue.Unlock()

				if atomic.LoadUint32(&p.closed) == 1 && len(tasks) == 0 {
					return
				}

				for _, task := range tasks {
					task.fn()
					task.wg.Done()
				}
			}
		}()
	}
}

var poolTaskPool = sync.Pool{New: func() interface{} { return PoolTask{wg: new(sync.WaitGroup)}}}

func (p *Pool) Submit(fn func()) {
	if atomic.LoadUint32(&p.closed) == 1 {
		fn()
		return
	}

	task := poolTaskPool.Get().(PoolTask)
	defer poolTaskPool.Put(task)

	task.wg.Add(1)
	task.fn = fn

	p.queue.Lock()
	p.queue.entries = append(p.queue.entries, task)
	p.queue.cond.Signal()
	p.queue.Unlock()

	task.wg.Wait()
}

lithdew · 2021-07-02T18:52:44Z

After looking at the trace logs closer and talking with some people about the issue on Discord, unfortunately, the approach of using runtime.{Lock, Unlock}OSThread() to create a goroutine pool for CPU-bound tasks does not prevent the I/O-bound goroutines from starvation.

With the main reason being this particular comment:

@ianlancetaylor (comment): Another, probably more important, issue is that the Go scheduler only permits up to GOMAXPROCS threads to be running Go code at one time. It enforces that restriction even on goroutines that are locked to threads using LockOSThread. So the scheduler will also preempt locked threads in order to let other goroutines run.

Yes. LockOSThread does not affect goroutine scheduling. I guess if people find this to be confusing we could add a sentence to the docs.

Because of this limitation in the scheduler, it is still possible that CPU-bound goroutines that are locked to OS threads can still starve I/O-bound goroutines.

Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.

Yawning · 2021-07-02T21:27:58Z

Hmm. That's unfortunate, though I am somewhat confused since a combination of increasing GOMAXPROCS and locking the workers to a subset of them should guarantee that there always are OS level threads available to do I/O bound work (Ian Taylor's comment and the linked issue seemed to be concerned with the reverse situation, where they want CPU bound work to be prioritized, which isn't guaranteed).

The situation should be something like "there are count *2 OS threads, of which only count are allowed to execute the go routines dedicated to batch verification (but they can also execute other things)". This is all sorts of awful if the intention was to prevent I/O bound tasks from starving CPU bound ones, but I still don't see a reason why the reverse isn't possible.

Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table.

Another thing that may be worth investigation is to break up each large batch into sub-batches, with explicit yields between each sub-batch. It is marginally more book-keeping in the caller-side, but it is functionally similar to inserting yields into the batch verification routine itself, just with a coarser yield interval. The underlying algorithm(s) do favor batches being as large as possible though.

I might have more ideas if I could see how the rest of the application is put together as well.

Yawning · 2021-07-14T10:29:34Z

TLDR: Closing due to lack of required functionality in the toolchain, and the compiler should be doing the right thing anyway.

I'm going to close this as "not much I can do about it" for the following reasons:

There is no current way to annotate assembly routines for such a thing.
The assembly used in this context does not have execution time that should reach the point of having scheduling issues anyway.
The expensive vectorized assembly routines use a stack frame (so the compiler should be auto-inserting a safe point already).
The multiscalar multiply used during large batch verification has a large number of function calls (so the compiler should be auto-inserting a safe point already).

If someone comes up with a sensible location for explicit scheduler yields that does not overly hurt performance, I'm more than happy to take a PR. As a side note, I am still unconvinced that using runtime calls to limit which OS threads can do batch verifcation won't work, but at this point this is more "think about how the caller does things" than how this library behaves.

Yawning added the enhancement New feature or request label Jul 2, 2021

Yawning closed this as completed Jul 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: provide safepoints for pre-emptive scheduling #72

perf: provide safepoints for pre-emptive scheduling #72

lithdew commented Jul 2, 2021 •

edited

Loading

Yawning commented Jul 2, 2021

Yawning commented Jul 2, 2021

lithdew commented Jul 2, 2021

lithdew commented Jul 2, 2021 •

edited

Loading

lithdew commented Jul 2, 2021 •

edited

Loading

Yawning commented Jul 2, 2021

Yawning commented Jul 14, 2021 •

edited

Loading

perf: provide safepoints for pre-emptive scheduling #72

perf: provide safepoints for pre-emptive scheduling #72

Comments

lithdew commented Jul 2, 2021 • edited Loading

Yawning commented Jul 2, 2021

Yawning commented Jul 2, 2021

lithdew commented Jul 2, 2021

lithdew commented Jul 2, 2021 • edited Loading

lithdew commented Jul 2, 2021 • edited Loading

Yawning commented Jul 2, 2021

Yawning commented Jul 14, 2021 • edited Loading

lithdew commented Jul 2, 2021 •

edited

Loading

lithdew commented Jul 2, 2021 •

edited

Loading

lithdew commented Jul 2, 2021 •

edited

Loading

Yawning commented Jul 14, 2021 •

edited

Loading