-
Notifications
You must be signed in to change notification settings - Fork 2
Description
affects ~25-35% of exec calls for fast-completing commands
Summary
The exec WebSocket endpoint has a race condition where fast-completing commands (those finishing in under ~200ms) produce zero output. The server starts the exec process before the WebSocket-to-process stdio relay is fully established. When the process exits before the relay is ready, the server closes the WebSocket with code 1000 (normal close) and an empty reason, having sent zero data messages. The client receives no stdout, no stderr, and no EXIT stream message.
Steps to Reproduce
- Create or acquire a sprite (e.g.,
pool-agents-000). - Open a WebSocket connection to
wss://api.sprites.dev/v1/sprites/{name}/exec?cmd=bash&args=-c&args=echo%20ok. - Observe that the connection succeeds.
- Wait for messages and the close frame.
- Repeat 40+ times to observe the failure rate.
Minimal reproduction command:
echo ok
Failure rate: Approximately 30% of calls for echo ok (n=40).
Note: Any trivially fast command reproduces the issue. No special client configuration, concurrency, or call cadence is required.
Expected Behavior
Every successful WebSocket connection to the exec endpoint should deliver:
- One or more binary data messages on the stdout stream (stream_id=1) containing the command's standard output.
- Optionally, binary data messages on the stderr stream (stream_id=2).
- A final EXIT stream message (stream_id=3) containing the process exit code.
- WebSocket close.
For echo ok, the expected output is exactly 2 binary messages: one stdout message containing ok\n, and one EXIT message with exit code 0.
Actual Behavior
In approximately 30% of calls:
- The WebSocket connection succeeds (handshake completes normally).
- The server sends zero binary data messages -- no stdout, no stderr, no EXIT stream message.
- The server closes the WebSocket with close code 1000 (normal closure) and an empty reason string.
- Total elapsed time is ~0.3-0.5 seconds (same as successful calls), confirming the WebSocket handshake always succeeds.
From the client's perspective, the command appears to have never executed: exit code defaults to a failure value, and both stdout and stderr are empty.
Root Cause Analysis
The failure is a server-side race condition in the relay setup between the WebSocket and the exec process's stdio:
- Client opens WebSocket -- succeeds (always).
- Server starts the exec process on the VM.
- Server begins setting up a relay between the WebSocket and the process stdio.
- Race condition: For fast commands (e.g.,
echo okcompletes in <1ms), the process exits BEFORE the relay is fully established. - Server sees the process has exited and closes the WebSocket with code 1000 and an empty reason.
- Client receives zero data messages.
The critical issue is that the server does not buffer process output during relay setup, so any output produced before the relay is ready is lost.
Evidence
1. Sleep prefix test
Adding a sleep prefix before the command progressively eliminates the failure. A 200ms sleep is sufficient to give the relay time to establish.
| Command | Failure Rate | Sample Size |
|---|---|---|
echo ok |
30.0% | 40 |
sleep 0.05 && echo ok |
15.0% | 40 |
sleep 0.1 && echo ok |
2.5% | 40 |
sleep 0.2 && echo ok |
0.0% | 40 |
sleep 0.5 && echo ok |
0.0% | 40 |
This strongly indicates the server-side relay requires approximately 100-200ms to fully establish after the WebSocket connection is accepted.
2. Payload size test
Larger commands, which inherently take longer to process, exhibit lower failure rates:
| Payload | Failure Rate | Sample Size |
|---|---|---|
Trivial (echo ok) |
26.7% | 30 |
1KB (echo AAA...) |
26.7% | 30 |
| 10KB | 33.3% | 30 |
| 50KB | 3.3% | 30 |
The 50KB payload takes measurably longer for bash to parse, giving the relay time to set up.
3. Deep WebSocket instrumentation
We instrumented the WebSocket I/O loop at the client side (50 calls, 21 failures):
- All 21 failures:
ConnectionClosedexception with code=1000, reason='', and exactly 0 binary messages received before the close frame. No EXIT stream message was delivered. - All 29 successes: Exactly 2 binary messages received (1 stdout message + 1 EXIT message with exit code 0).
- 0 exceptions from WebSocket connect/handshake -- the connection always establishes successfully.
4. Concurrency and rate limiting ruled out
- Inter-call delay has no effect: rapid-fire calls (32% failure) vs. 200ms-spaced calls (44% failure) show no improvement with spacing.
- Concurrent calls (batches of 5) produce the same failure rate as sequential calls (36.7% vs. 30.0%, within noise for n=30).
- The failure is per-call and independent, not related to client-side concurrency or rate limiting.
Client Environment
- SDK: sprites-py (Python)
- WebSocket library: websockets v16.0
- Python: 3.10.12
- Client OS: macOS
- API endpoint:
https://api.sprites.dev
Suggested Fix
The server should ensure the WebSocket-to-process relay is fully established before the exec process begins executing, or buffer the process's stdout/stderr output until the relay is ready to forward it. Specifically:
- Option A (preferred): Set up the relay pipes/channels first, then start the process. This ensures no output can be produced before the relay is ready to capture it.
- Option B: Buffer all process output server-side and only begin forwarding once the relay is confirmed ready. Flush the buffer through the relay before allowing the WebSocket to close.
- Option C: If the process exits before the relay is ready, read the process's buffered stdout/stderr and send it through the WebSocket before closing.
In all cases, the server should always send the EXIT stream message with the process's actual exit code before closing the WebSocket, regardless of how quickly the process completes.
Current Client-Side Workaround
We are working around this issue by:
- Prefixing all exec commands with
sleep 0.2 &&to give the relay time to establish. - Maintaining retry logic (3 attempts with 0.5s delay between retries) as a safety net.
This adds latency and complexity to every exec call and does not address the root cause.