Skip to content

Notify-based shutdown in HttpProxy causes severe lock contention on multi-core / NUMA systems #844

@Masterlvng

Description

@Masterlvng

Describe the bug

The tokio::sync::Notify used for shutdown signaling in HttpProxy::handle_new_request() introduces a shared Mutex<WaiterList> that is locked twice per request per connection. On high-core-count NUMA machines (tested on 128 cores), this becomes the dominant performance bottleneck — ~65% of off-CPU time is spent in do_futex waiting for this single lock, preventing CPU utilization from scaling.

The affected code is in pingora-proxy/src/lib.rs Lines 219–226:

let res = tokio::select! {
    biased;
    res = downstream_session.read_request() => { res }
    _ = self.shutdown.notified() => {
        return None;
    }
};

self.shutdown is a single tokio::sync::Notify (line 117) shared across the entire HttpProxy instance. Internally, Notify maintains a Mutex<WaiterList>. Every time select! is entered:

  1. poll_notified() locks the Mutex to register a waiter node.
  2. When read_request() resolves first, dropping the Notified future locks the Mutex again to remove the waiter node.

This means every HTTP request acquires and releases this global Mutex twice, even though shutdown almost never actually fires.

On multi-socket NUMA systems the problem is amplified: the Mutex cache line bounces across NUMA nodes via QPI/UPI (CAS latency jumps from ~15ns local to ~80-150ns cross-node), and futex(FUTEX_WAKE) may wake threads on remote nodes, causing persistent cache line ping-pong.

Pingora info

Pingora version: commit 9a4eee3 (main branch)
Rust version: cargo 1.87.0 (99624be96 2025-05-06)
Operating system version: Linux (128-core, multi-socket NUMA)

Steps to reproduce

  1. Deploy pingora as an HTTP reverse proxy on a high-core-count machine (e.g. 128 cores, multi-socket NUMA).
  2. Run a high-concurrency benchmark with keep-alive connections (e.g. wrk -t128 -c10000).
  3. Collect an off-CPU flame graph:
    perf record -g --call-graph dwarf -e sched:sched_switch -p <pingora_pid> -- sleep 30
  4. Observe that CPU utilization plateaus well below 100% despite available cores.

Expected results

CPU utilization should scale proportionally with the number of cores. Threads should spend the vast majority of their time processing requests, not waiting on internal locks.

Observed results

Off-CPU flame graph shows the majority of idle time is lock contention inside tokio::sync::Notify:
Image

Call stack Off-CPU %
Notified::poll_notifiedMutex::lockdo_futex 34.88%
Notified::dropMutex::lockdo_futex 30.78%
Subtotal (Notify lock contention) 65.66%
__lll_lock_wait_private (glibc malloc arena lock) 11.71%
Total lock-waiting off-CPU ~77%

128 threads contending on a single Mutex means at any given moment, up to 127 threads are parked in futex_wait_queue_me() — sleeping instead of processing requests. This is the direct cause of CPU not scaling up.

Additional context

Proposed fix: replace Notify with AtomicBool flag check

Since shutdown is a one-shot, one-directional signal (transitions once from false to true and never reverts), a lock-free AtomicBool check is sufficient. The field shutdown_flag: Arc<AtomicBool> already exists (line 118) but is not used in handle_new_request().

Before (current — 2 Mutex lock/unlock per request):

let res = tokio::select! {
    biased;
    res = downstream_session.read_request() => { res }
    _ = self.shutdown.notified() => {
        return None;
    }
};

After (proposed — zero locks, one Relaxed load per request):

if self.shutdown_flag.load(Ordering::Relaxed) {
    return None;
}
let res = downstream_session.read_request().await;
if self.shutdown_flag.load(Ordering::Relaxed) {
    return None;
}

Why this works:

  • Relaxed load reads from the local CPU cache — all 128 threads read simultaneously with zero contention.
  • shutdown_flag is already set in http_cleanup() (line 1182) before notify_waiters().
  • Shutdown is not latency-critical. The current select! with biased already polls read_request() first, so it doesn't provide strict immediate-abort guarantees either.

Trade-off: The select! approach can abort a blocked read_request() mid-flight on shutdown. The flag approach waits for the current read_request() to complete. For graceful shutdown this is typically acceptable (and arguably better — avoids dropping in-flight requests). If immediate connection termination is desired, closing the listener or applying a read timeout is a more appropriate mechanism.

NUMA amplification detail

On 4-socket NUMA (common for 128-core configs), the single Mutex cache line bounces across 4 nodes. Average cross-node CAS latency is ~120ns (vs ~15ns local), and the waiter list nodes scattered across nodes make every list traversal (register/unregister waiter) a sequence of remote memory accesses. This creates a super-linear degradation: longer lock hold time → longer queues → longer waits → positive feedback loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions