-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
The tokio::sync::Notify used for shutdown signaling in HttpProxy::handle_new_request() introduces a shared Mutex<WaiterList> that is locked twice per request per connection. On high-core-count NUMA machines (tested on 128 cores), this becomes the dominant performance bottleneck — ~65% of off-CPU time is spent in do_futex waiting for this single lock, preventing CPU utilization from scaling.
The affected code is in pingora-proxy/src/lib.rs Lines 219–226:
let res = tokio::select! {
biased;
res = downstream_session.read_request() => { res }
_ = self.shutdown.notified() => {
return None;
}
};self.shutdown is a single tokio::sync::Notify (line 117) shared across the entire HttpProxy instance. Internally, Notify maintains a Mutex<WaiterList>. Every time select! is entered:
poll_notified()locks the Mutex to register a waiter node.- When
read_request()resolves first, dropping theNotifiedfuture locks the Mutex again to remove the waiter node.
This means every HTTP request acquires and releases this global Mutex twice, even though shutdown almost never actually fires.
On multi-socket NUMA systems the problem is amplified: the Mutex cache line bounces across NUMA nodes via QPI/UPI (CAS latency jumps from ~15ns local to ~80-150ns cross-node), and futex(FUTEX_WAKE) may wake threads on remote nodes, causing persistent cache line ping-pong.
Pingora info
Pingora version: commit 9a4eee3 (main branch)
Rust version: cargo 1.87.0 (99624be96 2025-05-06)
Operating system version: Linux (128-core, multi-socket NUMA)
Steps to reproduce
- Deploy pingora as an HTTP reverse proxy on a high-core-count machine (e.g. 128 cores, multi-socket NUMA).
- Run a high-concurrency benchmark with keep-alive connections (e.g.
wrk -t128 -c10000). - Collect an off-CPU flame graph:
perf record -g --call-graph dwarf -e sched:sched_switch -p <pingora_pid> -- sleep 30
- Observe that CPU utilization plateaus well below 100% despite available cores.
Expected results
CPU utilization should scale proportionally with the number of cores. Threads should spend the vast majority of their time processing requests, not waiting on internal locks.
Observed results
Off-CPU flame graph shows the majority of idle time is lock contention inside tokio::sync::Notify:

| Call stack | Off-CPU % |
|---|---|
Notified::poll_notified → Mutex::lock → do_futex |
34.88% |
Notified::drop → Mutex::lock → do_futex |
30.78% |
| Subtotal (Notify lock contention) | 65.66% |
__lll_lock_wait_private (glibc malloc arena lock) |
11.71% |
| Total lock-waiting off-CPU | ~77% |
128 threads contending on a single Mutex means at any given moment, up to 127 threads are parked in futex_wait_queue_me() — sleeping instead of processing requests. This is the direct cause of CPU not scaling up.
Additional context
Proposed fix: replace Notify with AtomicBool flag check
Since shutdown is a one-shot, one-directional signal (transitions once from false to true and never reverts), a lock-free AtomicBool check is sufficient. The field shutdown_flag: Arc<AtomicBool> already exists (line 118) but is not used in handle_new_request().
Before (current — 2 Mutex lock/unlock per request):
let res = tokio::select! {
biased;
res = downstream_session.read_request() => { res }
_ = self.shutdown.notified() => {
return None;
}
};After (proposed — zero locks, one Relaxed load per request):
if self.shutdown_flag.load(Ordering::Relaxed) {
return None;
}
let res = downstream_session.read_request().await;
if self.shutdown_flag.load(Ordering::Relaxed) {
return None;
}Why this works:
Relaxedload reads from the local CPU cache — all 128 threads read simultaneously with zero contention.shutdown_flagis already set inhttp_cleanup()(line 1182) beforenotify_waiters().- Shutdown is not latency-critical. The current
select!withbiasedalready pollsread_request()first, so it doesn't provide strict immediate-abort guarantees either.
Trade-off: The select! approach can abort a blocked read_request() mid-flight on shutdown. The flag approach waits for the current read_request() to complete. For graceful shutdown this is typically acceptable (and arguably better — avoids dropping in-flight requests). If immediate connection termination is desired, closing the listener or applying a read timeout is a more appropriate mechanism.
NUMA amplification detail
On 4-socket NUMA (common for 128-core configs), the single Mutex cache line bounces across 4 nodes. Average cross-node CAS latency is ~120ns (vs ~15ns local), and the waiter list nodes scattered across nodes make every list traversal (register/unregister waiter) a sequence of remote memory accesses. This creates a super-linear degradation: longer lock hold time → longer queues → longer waits → positive feedback loop.