Silent FPM master death causing persistent broken pipe errors (~1-2% of requests)

**Description**

We're seeing persistent `WriteFailedException: Failed to write request to socket [broken pipe]` errors at a ~1-2% error rate across all Lambda instances. The FPM **master process** is silently dying between requests — no error logged, no child exit warnings, no crash messages. Bref recovers by restarting FPM, but the current invocation returns a 500.

The error pattern on a single Lambda instance:

1. Request N completes normally (e.g., 220ms)
2. Request N+1 arrives ~6ms later
3. `WriteFailedException` immediately — the FPM socket is already dead
4. Bref restarts FPM, next request works

**It's the master dying, not the child.** We searched all CloudWatch logs for `child exited on signal` — zero matches. We verified this logging works by pulling the Docker image, starting FPM, and sending SIGSEGV to the child — the master correctly logged `WARNING: [pool default] child 10 exited on signal 11 (SIGSEGV)` to stderr via `--force-stderr`. The absence of these messages in production confirms the master itself is dying.

Errors also spontaneously drop to near-zero (~0.01%) for 7-14 hours without any deployment, then climb back to ~1-2%. This suggests state corruption that self-heals through Lambda instance recycling.

**How to reproduce**

We haven't found a reliable reproduction — it's intermittent and only manifests in production under real traffic. It has persisted since our initial Bref migration and across PHP 8.1 and 8.4.

- **Bref**: 2.4.18 (Docker images, not layers)
- **Docker image**: `bref/php-84-fpm:2`
- **PHP**: 8.4.18
- **AWS service**: Lambda (via API Gateway HTTP API)
- **Lambda config**: 1024 MB memory, 28s timeout
- **FPM config**: Bref defaults (`pm=static`, `max_children=1`, `log_limit=8192`)
- **Extensions**: redis (Predis pure PHP client), intl, opcache, pcntl, posix, pdo_mysql
- **Framework**: Laravel 10

**What we've ruled out**

- SIGPIPE (#1854) — zero matches in CloudWatch
- JIT segfaults (#842) — `opcache.jit = disable`, verified from image
- OOM — memory well within Lambda limits
- Excimer profiler — removed from Dockerfile entirely, errors persist
- Sentry tracing/profiling — fully disabled, errors persist
- Redis/Valkey — Predis runs in child only, master never touches it
- Large responses — no correlation between response size and errors
- stderr log_limit — max log message is ~2KB, well under the 8192 limit
- Cold starts — orders of magnitude fewer than error count
- PHP version — reproduced on both 8.1 and 8.4

**Questions**

  - Do you have any theories on what could cause the FPM master to die silently? We're happy to run any
  diagnostic steps or share additional logs/data.
  - Is there any known interaction between Lambda freeze/thaw and FPM? Could the cgroup freezer leave the master in a broken state after thaw?
  - We noticed the catch block in [FpmHandler.php](https://github.com/brefphp/bref/blob/master/src/FpmRuntime/FpmHandler.php#L162) (line 162) calls $this->stop() before capturing
  proc_get_status($this->fpm), so the exit code and termination signal are lost. Would you accept a PR adding
  this logging on the failure path?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Silent FPM master death causing persistent broken pipe errors (~1-2% of requests) #2077

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Silent FPM master death causing persistent broken pipe errors (~1-2% of requests) #2077

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions