Skip to content

Silent FPM master death causing persistent broken pipe errors (~1-2% of requests) #2077

@sid-stripe

Description

@sid-stripe

Description

We're seeing persistent WriteFailedException: Failed to write request to socket [broken pipe] errors at a ~1-2% error rate across all Lambda instances. The FPM master process is silently dying between requests — no error logged, no child exit warnings, no crash messages. Bref recovers by restarting FPM, but the current invocation returns a 500.

The error pattern on a single Lambda instance:

  1. Request N completes normally (e.g., 220ms)
  2. Request N+1 arrives ~6ms later
  3. WriteFailedException immediately — the FPM socket is already dead
  4. Bref restarts FPM, next request works

It's the master dying, not the child. We searched all CloudWatch logs for child exited on signal — zero matches. We verified this logging works by pulling the Docker image, starting FPM, and sending SIGSEGV to the child — the master correctly logged WARNING: [pool default] child 10 exited on signal 11 (SIGSEGV) to stderr via --force-stderr. The absence of these messages in production confirms the master itself is dying.

Errors also spontaneously drop to near-zero (~0.01%) for 7-14 hours without any deployment, then climb back to ~1-2%. This suggests state corruption that self-heals through Lambda instance recycling.

How to reproduce

We haven't found a reliable reproduction — it's intermittent and only manifests in production under real traffic. It has persisted since our initial Bref migration and across PHP 8.1 and 8.4.

  • Bref: 2.4.18 (Docker images, not layers)
  • Docker image: bref/php-84-fpm:2
  • PHP: 8.4.18
  • AWS service: Lambda (via API Gateway HTTP API)
  • Lambda config: 1024 MB memory, 28s timeout
  • FPM config: Bref defaults (pm=static, max_children=1, log_limit=8192)
  • Extensions: redis (Predis pure PHP client), intl, opcache, pcntl, posix, pdo_mysql
  • Framework: Laravel 10

What we've ruled out

  • SIGPIPE (PHP-FPM children die with SIGPIPE since 8.3.10 #1854) — zero matches in CloudWatch
  • JIT segfaults (FastCgiCommunicationFailed on every second request #842) — opcache.jit = disable, verified from image
  • OOM — memory well within Lambda limits
  • Excimer profiler — removed from Dockerfile entirely, errors persist
  • Sentry tracing/profiling — fully disabled, errors persist
  • Redis/Valkey — Predis runs in child only, master never touches it
  • Large responses — no correlation between response size and errors
  • stderr log_limit — max log message is ~2KB, well under the 8192 limit
  • Cold starts — orders of magnitude fewer than error count
  • PHP version — reproduced on both 8.1 and 8.4

Questions

  • Do you have any theories on what could cause the FPM master to die silently? We're happy to run any
    diagnostic steps or share additional logs/data.
  • Is there any known interaction between Lambda freeze/thaw and FPM? Could the cgroup freezer leave the master in a broken state after thaw?
  • We noticed the catch block in FpmHandler.php (line 162) calls $this->stop() before capturing
    proc_get_status($this->fpm), so the exit code and termination signal are lost. Would you accept a PR adding
    this logging on the failure path?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions