`Notify`-based shutdown in `HttpProxy` causes severe lock contention on multi-core / NUMA systems

## Describe the bug

The `tokio::sync::Notify` used for shutdown signaling in `HttpProxy::handle_new_request()` introduces a shared `Mutex<WaiterList>` that is locked **twice per request per connection**. On high-core-count NUMA machines (tested on 128 cores), this becomes the dominant performance bottleneck — **~65% of off-CPU time** is spent in `do_futex` waiting for this single lock, preventing CPU utilization from scaling.



The affected code is in [`pingora-proxy/src/lib.rs` Lines 219–226](https://github.com/cloudflare/pingora/blob/main/pingora-proxy/src/lib.rs#L219-L226):

```rust
let res = tokio::select! {
    biased;
    res = downstream_session.read_request() => { res }
    _ = self.shutdown.notified() => {
        return None;
    }
};
```

`self.shutdown` is a single `tokio::sync::Notify` ([line 117](https://github.com/cloudflare/pingora/blob/main/pingora-proxy/src/lib.rs#L117)) shared across the entire `HttpProxy` instance. Internally, `Notify` maintains a `Mutex<WaiterList>`. Every time `select!` is entered:

1. **`poll_notified()`** locks the Mutex to register a waiter node.
2. When `read_request()` resolves first, dropping the `Notified` future locks the Mutex **again** to remove the waiter node.

This means every HTTP request acquires and releases this global Mutex twice, even though shutdown almost never actually fires.

On multi-socket NUMA systems the problem is amplified: the Mutex cache line bounces across NUMA nodes via QPI/UPI (CAS latency jumps from ~15ns local to ~80-150ns cross-node), and `futex(FUTEX_WAKE)` may wake threads on remote nodes, causing persistent cache line ping-pong.

## Pingora info

**Pingora version**: commit `9a4eee3` (main branch)
**Rust version**: `cargo 1.87.0 (99624be96 2025-05-06)`
**Operating system version**: Linux (128-core, multi-socket NUMA)

## Steps to reproduce

1. Deploy pingora as an HTTP reverse proxy on a high-core-count machine (e.g. 128 cores, multi-socket NUMA).
2. Run a high-concurrency benchmark with keep-alive connections (e.g. `wrk -t128 -c10000`).
3. Collect an off-CPU flame graph:
   ```bash
   perf record -g --call-graph dwarf -e sched:sched_switch -p <pingora_pid> -- sleep 30
   ```
4. Observe that CPU utilization plateaus well below 100% despite available cores.

## Expected results

CPU utilization should scale proportionally with the number of cores. Threads should spend the vast majority of their time processing requests, not waiting on internal locks.

## Observed results

Off-CPU flame graph shows the majority of idle time is lock contention inside `tokio::sync::Notify`:
<img width="1496" height="862" alt="Image" src="https://github.com/user-attachments/assets/80ac6673-65a0-42bc-94c0-db114d4de278" />
| Call stack | Off-CPU % |
|---|---|
| `Notified::poll_notified` → `Mutex::lock` → `do_futex` | **34.88%** |
| `Notified::drop` → `Mutex::lock` → `do_futex` | **30.78%** |
| **Subtotal (Notify lock contention)** | **65.66%** |
| `__lll_lock_wait_private` (glibc malloc arena lock) | 11.71% |
| **Total lock-waiting off-CPU** | **~77%** |

128 threads contending on a single Mutex means at any given moment, up to 127 threads are parked in `futex_wait_queue_me()` — sleeping instead of processing requests. This is the direct cause of CPU not scaling up.

## Additional context

### Proposed fix: replace `Notify` with `AtomicBool` flag check

Since shutdown is a **one-shot, one-directional signal** (transitions once from `false` to `true` and never reverts), a lock-free `AtomicBool` check is sufficient. The field `shutdown_flag: Arc<AtomicBool>` already exists ([line 118](https://github.com/cloudflare/pingora/blob/main/pingora-proxy/src/lib.rs#L118)) but is not used in `handle_new_request()`.

**Before** (current — 2 Mutex lock/unlock per request):

```rust
let res = tokio::select! {
    biased;
    res = downstream_session.read_request() => { res }
    _ = self.shutdown.notified() => {
        return None;
    }
};
```

**After** (proposed — zero locks, one Relaxed load per request):

```rust
if self.shutdown_flag.load(Ordering::Relaxed) {
    return None;
}
let res = downstream_session.read_request().await;
if self.shutdown_flag.load(Ordering::Relaxed) {
    return None;
}
```

**Why this works:**

- `Relaxed` load reads from the local CPU cache — all 128 threads read simultaneously with **zero contention**.
- `shutdown_flag` is already set in `http_cleanup()` ([line 1182](https://github.com/cloudflare/pingora/blob/main/pingora-proxy/src/lib.rs#L1182)) before `notify_waiters()`.
- Shutdown is not latency-critical. The current `select!` with `biased` already polls `read_request()` first, so it doesn't provide strict immediate-abort guarantees either.

**Trade-off:** The `select!` approach can abort a blocked `read_request()` mid-flight on shutdown. The flag approach waits for the current `read_request()` to complete. For graceful shutdown this is typically acceptable (and arguably better — avoids dropping in-flight requests). If immediate connection termination is desired, closing the listener or applying a read timeout is a more appropriate mechanism.

### NUMA amplification detail

On 4-socket NUMA (common for 128-core configs), the single Mutex cache line bounces across 4 nodes. Average cross-node CAS latency is ~120ns (vs ~15ns local), and the waiter list nodes scattered across nodes make every list traversal (register/unregister waiter) a sequence of remote memory accesses. This creates a super-linear degradation: longer lock hold time → longer queues → longer waits → positive feedback loop.


Call stack	Off-CPU %
`Notified::poll_notified` → `Mutex::lock` → `do_futex`	34.88%
`Notified::drop` → `Mutex::lock` → `do_futex`	30.78%
Subtotal (Notify lock contention)	65.66%
`__lll_lock_wait_private` (glibc malloc arena lock)	11.71%
Total lock-waiting off-CPU	~77%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Notify`-based shutdown in `HttpProxy` causes severe lock contention on multi-core / NUMA systems #844

Describe the bug

Pingora info

Steps to reproduce

Expected results

Observed results

Additional context

Proposed fix: replace `Notify` with `AtomicBool` flag check

NUMA amplification detail

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Notify-based shutdown in HttpProxy causes severe lock contention on multi-core / NUMA systems #844

Description

Describe the bug

Pingora info

Steps to reproduce

Expected results

Observed results

Additional context

Proposed fix: replace Notify with AtomicBool flag check

NUMA amplification detail

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Notify`-based shutdown in `HttpProxy` causes severe lock contention on multi-core / NUMA systems #844

Proposed fix: replace `Notify` with `AtomicBool` flag check