-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: provide safepoints for pre-emptive scheduling #72
Comments
Oh boy. Note: I just quickly skimmed the related issues, so my understanding of what the runtime does is surface-level at best). I assume that
As far as I can tell, this is not possible, might not be needed, and in may not help that much. From what I can tell by skimming the issues, the runtime's (new-ish) async preemption mechanism is based on periodically sending SIGURG to the process (once every 10? ms), and even prior to the introduction of the async preemption mechanism, function calls are preemption points (specifically when a stack frame is allocated). For the sake of my implementation sanity, the underlying operations in what I assume is the problematic routine are already broken up into sub-routines, and while the multiscalar multiply does have a number of tight loops, each loop iteration already has numerous occasions for the scheduler to do the right thing (since each call to One thing that you could try is to insert explicit call(s) to curve25519-voi/curve/scalar_mul_pippenger.go Line 177 in fc681bf
curve25519-voi/curve/scalar_mul_pippenger.go Line 184 in fc681bf
curve25519-voi/curve/scalar_mul_pippenger.go Line 198 in fc681bf
Technically yes. The routines that would need additional pre-emption points are also used for things that work on secret material, and so additional copies of sensitive intermediaries will get spilled somewhere. That said, such things are outside of the threat model of this library, due it it being an extremely difficult problem to solve in general, especially with Go. (Note: I am technically out of the office for the next week or so. Responses may be delayed, but I am interested in getting this library to work well for you.) |
Addendum (which naturally occurred to me after I turned off my laptop....)
You could try doing just that as well like thus:
This should push the responsibility of "making the right thing happen" to being the kernel's problem rather than the Go scheduler's. |
Thanks for the quick response! I'll report how things go after inserting a few explicit yield points in the areas you suggested, and after offloading all crypto code to a dedicated goroutine pool for CPU-bound tasks. The idea of creating a goroutine pool dedicated to executing code by making use of |
EDIT: Whether or not the solution below actually solves the problem of I/O-bound goroutines being starved by CPU-bound code, take it with a grain of salt. I'm going to try to run my benchmarks on a few different environments and take some time to look at the trace logs just to make sure. IIRC I heard long ago that there were certain cases/instances where I went for the latter approach of offloading the execution of all crypto code to a dedicated goroutine pool for CPU-bound tasks, and it looks all of the I/O-bound goroutines are no longer being starved 😌. The pool is a bit makeshift (i.e. hardcoded constants, forcefully allocates GOMAXPROCS additional processes), though here's the code for it in case it may be helpful. package rheia
import (
"runtime"
"sync"
"sync/atomic"
)
const PoolWorkerPipelineSize = 1
type PoolTask struct {
wg *sync.WaitGroup
fn func()
}
type Pool struct {
closed uint32
queue struct {
sync.Mutex
cond sync.Cond
entries []PoolTask
}
wg sync.WaitGroup
}
func NewPool() *Pool {
pool := new(Pool)
pool.queue.cond.L = &pool.queue.Mutex
return pool
}
func (p *Pool) Close() {
if !atomic.CompareAndSwapUint32(&p.closed, 0, 1) {
return
}
p.queue.cond.Broadcast()
p.wg.Wait()
}
func (p *Pool) Run() {
count := runtime.GOMAXPROCS(0)
runtime.GOMAXPROCS(count * 2)
p.wg.Add(count)
for i := 0; i < count; i++ {
go func() {
runtime.LockOSThread()
defer runtime.UnlockOSThread()
defer p.wg.Done()
for {
p.queue.Lock()
for atomic.LoadUint32(&p.closed) != 1 && len(p.queue.entries) == 0 {
p.queue.cond.Wait()
}
count, overflow := len(p.queue.entries), false
if count > PoolWorkerPipelineSize {
count, overflow = PoolWorkerPipelineSize, true
}
tasks := p.queue.entries[:count]
p.queue.entries = p.queue.entries[count:]
if overflow {
p.queue.cond.Signal()
}
p.queue.Unlock()
if atomic.LoadUint32(&p.closed) == 1 && len(tasks) == 0 {
return
}
for _, task := range tasks {
task.fn()
task.wg.Done()
}
}
}()
}
}
var poolTaskPool = sync.Pool{New: func() interface{} { return PoolTask{wg: new(sync.WaitGroup)}}}
func (p *Pool) Submit(fn func()) {
if atomic.LoadUint32(&p.closed) == 1 {
fn()
return
}
task := poolTaskPool.Get().(PoolTask)
defer poolTaskPool.Put(task)
task.wg.Add(1)
task.fn = fn
p.queue.Lock()
p.queue.entries = append(p.queue.entries, task)
p.queue.cond.Signal()
p.queue.Unlock()
task.wg.Wait()
} |
After looking at the trace logs closer and talking with some people about the issue on Discord, unfortunately, the approach of using With the main reason being this particular comment:
Because of this limitation in the scheduler, it is still possible that CPU-bound goroutines that are locked to OS threads can still starve I/O-bound goroutines. Perhaps mending the crypto code to cooperatively yield to the scheduler in certain tight loops, or using cgo and spawning a dedicated thread pool, might be the only two options left on the table. |
Hmm. That's unfortunate, though I am somewhat confused since a combination of increasing GOMAXPROCS and locking the workers to a subset of them should guarantee that there always are OS level threads available to do I/O bound work (Ian Taylor's comment and the linked issue seemed to be concerned with the reverse situation, where they want CPU bound work to be prioritized, which isn't guaranteed). The situation should be something like "there are
Another thing that may be worth investigation is to break up each large batch into sub-batches, with explicit yields between each sub-batch. It is marginally more book-keeping in the caller-side, but it is functionally similar to inserting yields into the batch verification routine itself, just with a coarser yield interval. The underlying algorithm(s) do favor batches being as large as possible though. I might have more ideas if I could see how the rest of the application is put together as well. |
TLDR: Closing due to lack of required functionality in the toolchain, and the compiler should be doing the right thing anyway. I'm going to close this as "not much I can do about it" for the following reasons:
If someone comes up with a sensible location for explicit scheduler yields that does not overly hurt performance, I'm more than happy to take a PR. As a side note, I am still unconvinced that using runtime calls to limit which OS threads can do batch verifcation won't work, but at this point this is more "think about how the caller does things" than how this library behaves. |
I'd like to raise this as more of a discussion, though it seems like there may be a way to mark safepoints in assembly code that the Go scheduler may pre-empt on.
Given that crypto code in general is CPU-bound, and that there isn't an option to i.e. offload the execution of such code into a separate thread pool away from I/O-bound tasks, providing safepoints would improve the overall latency of all tasks in Go programs that rely on crypto.
Where this would prove valuable is that I've been working on a Go program that makes heavy use of network I/O and ed25519 batch verification - the biggest bottleneck of the program right now is that noticeably, almost all of the goroutines in the program that are I/O-bound are being starved and driven nearly to a complete halt whenever there is any sort of crypto code being executed in a separate goroutine. I noticed this behavior by recording traces of the program and opening them up in Go's trace viewer.
The Zig/C equivalent of the same program, which makes use of dedicated thread pools for I/O-bound code and CPU-bound code, has significantly lower latencies and scheduler contention in comparison.
Now, I'd like this to be more of a discussion because I have two open questions:
The text was updated successfully, but these errors were encountered: