Skip to content

apparent race condition for fast finishing commands #16

@netguy204

Description

@netguy204

affects ~25-35% of exec calls for fast-completing commands

Summary

The exec WebSocket endpoint has a race condition where fast-completing commands (those finishing in under ~200ms) produce zero output. The server starts the exec process before the WebSocket-to-process stdio relay is fully established. When the process exits before the relay is ready, the server closes the WebSocket with code 1000 (normal close) and an empty reason, having sent zero data messages. The client receives no stdout, no stderr, and no EXIT stream message.

Steps to Reproduce

  1. Create or acquire a sprite (e.g., pool-agents-000).
  2. Open a WebSocket connection to wss://api.sprites.dev/v1/sprites/{name}/exec?cmd=bash&args=-c&args=echo%20ok.
  3. Observe that the connection succeeds.
  4. Wait for messages and the close frame.
  5. Repeat 40+ times to observe the failure rate.

Minimal reproduction command:

echo ok

Failure rate: Approximately 30% of calls for echo ok (n=40).

Note: Any trivially fast command reproduces the issue. No special client configuration, concurrency, or call cadence is required.

Expected Behavior

Every successful WebSocket connection to the exec endpoint should deliver:

  1. One or more binary data messages on the stdout stream (stream_id=1) containing the command's standard output.
  2. Optionally, binary data messages on the stderr stream (stream_id=2).
  3. A final EXIT stream message (stream_id=3) containing the process exit code.
  4. WebSocket close.

For echo ok, the expected output is exactly 2 binary messages: one stdout message containing ok\n, and one EXIT message with exit code 0.

Actual Behavior

In approximately 30% of calls:

  • The WebSocket connection succeeds (handshake completes normally).
  • The server sends zero binary data messages -- no stdout, no stderr, no EXIT stream message.
  • The server closes the WebSocket with close code 1000 (normal closure) and an empty reason string.
  • Total elapsed time is ~0.3-0.5 seconds (same as successful calls), confirming the WebSocket handshake always succeeds.

From the client's perspective, the command appears to have never executed: exit code defaults to a failure value, and both stdout and stderr are empty.

Root Cause Analysis

The failure is a server-side race condition in the relay setup between the WebSocket and the exec process's stdio:

  1. Client opens WebSocket -- succeeds (always).
  2. Server starts the exec process on the VM.
  3. Server begins setting up a relay between the WebSocket and the process stdio.
  4. Race condition: For fast commands (e.g., echo ok completes in <1ms), the process exits BEFORE the relay is fully established.
  5. Server sees the process has exited and closes the WebSocket with code 1000 and an empty reason.
  6. Client receives zero data messages.

The critical issue is that the server does not buffer process output during relay setup, so any output produced before the relay is ready is lost.

Evidence

1. Sleep prefix test

Adding a sleep prefix before the command progressively eliminates the failure. A 200ms sleep is sufficient to give the relay time to establish.

Command Failure Rate Sample Size
echo ok 30.0% 40
sleep 0.05 && echo ok 15.0% 40
sleep 0.1 && echo ok 2.5% 40
sleep 0.2 && echo ok 0.0% 40
sleep 0.5 && echo ok 0.0% 40

This strongly indicates the server-side relay requires approximately 100-200ms to fully establish after the WebSocket connection is accepted.

2. Payload size test

Larger commands, which inherently take longer to process, exhibit lower failure rates:

Payload Failure Rate Sample Size
Trivial (echo ok) 26.7% 30
1KB (echo AAA...) 26.7% 30
10KB 33.3% 30
50KB 3.3% 30

The 50KB payload takes measurably longer for bash to parse, giving the relay time to set up.

3. Deep WebSocket instrumentation

We instrumented the WebSocket I/O loop at the client side (50 calls, 21 failures):

  • All 21 failures: ConnectionClosed exception with code=1000, reason='', and exactly 0 binary messages received before the close frame. No EXIT stream message was delivered.
  • All 29 successes: Exactly 2 binary messages received (1 stdout message + 1 EXIT message with exit code 0).
  • 0 exceptions from WebSocket connect/handshake -- the connection always establishes successfully.

4. Concurrency and rate limiting ruled out

  • Inter-call delay has no effect: rapid-fire calls (32% failure) vs. 200ms-spaced calls (44% failure) show no improvement with spacing.
  • Concurrent calls (batches of 5) produce the same failure rate as sequential calls (36.7% vs. 30.0%, within noise for n=30).
  • The failure is per-call and independent, not related to client-side concurrency or rate limiting.

Client Environment

  • SDK: sprites-py (Python)
  • WebSocket library: websockets v16.0
  • Python: 3.10.12
  • Client OS: macOS
  • API endpoint: https://api.sprites.dev

Suggested Fix

The server should ensure the WebSocket-to-process relay is fully established before the exec process begins executing, or buffer the process's stdout/stderr output until the relay is ready to forward it. Specifically:

  1. Option A (preferred): Set up the relay pipes/channels first, then start the process. This ensures no output can be produced before the relay is ready to capture it.
  2. Option B: Buffer all process output server-side and only begin forwarding once the relay is confirmed ready. Flush the buffer through the relay before allowing the WebSocket to close.
  3. Option C: If the process exits before the relay is ready, read the process's buffered stdout/stderr and send it through the WebSocket before closing.

In all cases, the server should always send the EXIT stream message with the process's actual exit code before closing the WebSocket, regardless of how quickly the process completes.

Current Client-Side Workaround

We are working around this issue by:

  1. Prefixing all exec commands with sleep 0.2 && to give the relay time to establish.
  2. Maintaining retry logic (3 attempts with 0.5s delay between retries) as a safety net.

This adds latency and complexity to every exec call and does not address the root cause.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions