-
-
Notifications
You must be signed in to change notification settings - Fork 375
Description
Description
We're seeing persistent WriteFailedException: Failed to write request to socket [broken pipe] errors at a ~1-2% error rate across all Lambda instances. The FPM master process is silently dying between requests — no error logged, no child exit warnings, no crash messages. Bref recovers by restarting FPM, but the current invocation returns a 500.
The error pattern on a single Lambda instance:
- Request N completes normally (e.g., 220ms)
- Request N+1 arrives ~6ms later
WriteFailedExceptionimmediately — the FPM socket is already dead- Bref restarts FPM, next request works
It's the master dying, not the child. We searched all CloudWatch logs for child exited on signal — zero matches. We verified this logging works by pulling the Docker image, starting FPM, and sending SIGSEGV to the child — the master correctly logged WARNING: [pool default] child 10 exited on signal 11 (SIGSEGV) to stderr via --force-stderr. The absence of these messages in production confirms the master itself is dying.
Errors also spontaneously drop to near-zero (~0.01%) for 7-14 hours without any deployment, then climb back to ~1-2%. This suggests state corruption that self-heals through Lambda instance recycling.
How to reproduce
We haven't found a reliable reproduction — it's intermittent and only manifests in production under real traffic. It has persisted since our initial Bref migration and across PHP 8.1 and 8.4.
- Bref: 2.4.18 (Docker images, not layers)
- Docker image:
bref/php-84-fpm:2 - PHP: 8.4.18
- AWS service: Lambda (via API Gateway HTTP API)
- Lambda config: 1024 MB memory, 28s timeout
- FPM config: Bref defaults (
pm=static,max_children=1,log_limit=8192) - Extensions: redis (Predis pure PHP client), intl, opcache, pcntl, posix, pdo_mysql
- Framework: Laravel 10
What we've ruled out
- SIGPIPE (PHP-FPM children die with SIGPIPE since 8.3.10 #1854) — zero matches in CloudWatch
- JIT segfaults (FastCgiCommunicationFailed on every second request #842) —
opcache.jit = disable, verified from image - OOM — memory well within Lambda limits
- Excimer profiler — removed from Dockerfile entirely, errors persist
- Sentry tracing/profiling — fully disabled, errors persist
- Redis/Valkey — Predis runs in child only, master never touches it
- Large responses — no correlation between response size and errors
- stderr log_limit — max log message is ~2KB, well under the 8192 limit
- Cold starts — orders of magnitude fewer than error count
- PHP version — reproduced on both 8.1 and 8.4
Questions
- Do you have any theories on what could cause the FPM master to die silently? We're happy to run any
diagnostic steps or share additional logs/data. - Is there any known interaction between Lambda freeze/thaw and FPM? Could the cgroup freezer leave the master in a broken state after thaw?
- We noticed the catch block in FpmHandler.php (line 162) calls $this->stop() before capturing
proc_get_status($this->fpm), so the exit code and termination signal are lost. Would you accept a PR adding
this logging on the failure path?