Skip to content

Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (raise cold-start timeout)#283

Closed
rgarcia wants to merge 1 commit into
mainfrom
fix/devtoolsproxy-upstream-detection
Closed

Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (raise cold-start timeout)#283
rgarcia wants to merge 1 commit into
mainfrom
fix/devtoolsproxy-upstream-detection

Conversation

@rgarcia

@rgarcia rgarcia commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Problem

TestUpstreamManagerDetectsChromiumAndRestart (server/lib/devtoolsproxy/proxy_test.go) fails intermittently in CI (blocking #282):

proxy_test.go: chromium log file contents:
  ...ERROR:dbus/bus.cc:405] Failed to connect to the bus: Could not parse server address...
proxy_test.go: did not detect initial upstream for port 40185; got: ""
--- FAIL: TestUpstreamManagerDetectsChromiumAndRestart (20.15s)

Pre-existing flake on main, independent of #282 (which only touches chromedriverproxy + e2e).

Root cause (corrected after investigation)

The test launches a real headless Chromium and waits for UpstreamManager to scrape the DevTools listening on ws://... line from the browser log. The wait was capped at 20s per launch.

The real cause is Chromium cold-start latency with a long tail on shared CI runners — the wait times out, the line is not missing. Hard evidence:

1. CI duration distribution across recent main runs of this exact test (same /usr/bin/chromium invocation):

Result Durations
PASS (warm) 6.4s, 6.6s, 6.7s, 6.8s
PASS (slow) 10.1s, 14.3s, 15.4s, 16.96s
FAIL 20.14s, 20.15s, 20.16s

The failures cluster at exactly the 20s timeout — a classic too-tight-deadline signature, with PASS times already creeping to 17s on slow runners.

2. Reproduced with the runner's exact browser. GitHub's ubuntu-latest (24.04) /usr/bin/chromium is a symlink to an open-source Chromium 148.0.7778.0 snapshot build (installed by runner-images from chromium-browser-snapshots; confirmed via the runner-images README + install-google-chrome.sh). I downloaded that exact snapshot and ran the exact test command:

stdbuf -oL -eL <chromium> --headless=new --remote-debugging-address=127.0.0.1 \
  --remote-debugging-port=40185 --no-first-run ... about:blank

Result: the DevTools listening on ws://127.0.0.1:40185/... line is printed reliably (~0.5s warm, 20/20 iterations). So the earlier "this Chromium build never prints the line" theory was wrong.

3. The dbus errors are a red herring. The failure log's dbus/bus.cc:405 Could not parse server address is benign — I reproduced it by setting a malformed DBUS_SESSION_BUS_ADDRESS, and Chromium still printed the DevTools line despite the dbus errors. dbus has nothing to do with CDP/DevTools startup.

So: on a contended runner, the first Chromium launch occasionally takes >20s to bring up its DevTools HTTP server, and the test's 20s waitForCondition expires first.

Fix

  • Raise the per-launch detection wait to 60s (named browserDetectTimeout, with the rationale documented) so the slow cold-start tail clears while still failing fast if the browser genuinely never comes up. This is the actual flake fix.
  • Secondary hygiene (not the root cause): findBrowserBinary now honors CHROMIUM_BIN/CHROME_BIN and prefers the deterministic chrome/google-chrome binaries, and server-test.yaml pins CHROMIUM_BIN to setup-chrome's chrome-path. This makes the browser-under-test deterministic across runner-image updates; it does not by itself fix the latency flake.

Test output

$ go build ./... && go vet ./lib/devtoolsproxy/      # clean
# run with the EXACT runner Chromium 148.0.7778.0 snapshot:
$ CHROMIUM_BIN=.../chrome-linux/chrome go test ./lib/devtoolsproxy/ -run ...AndRestart -v
--- PASS: TestUpstreamManagerDetectsChromiumAndRestart (1.28s)

Relationship to #282

#282 is blocked only by this pre-existing flake. After this merges, #282 should rebase on main to pick up the timeout + workflow changes and get a green CI run.

🤖 Generated with Claude Code

… timeout)

TestUpstreamManagerDetectsChromiumAndRestart launches a real headless
Chromium and waits for UpstreamManager to scrape the "DevTools listening on
ws://..." line. The wait was capped at 20s per browser launch.

Root cause: Chromium's cold-start latency has a long tail on GitHub's shared
ubuntu-latest runners. Across recent main runs this exact launch printed the
DevTools line in ~6s on a warm runner but took 10-17s on a contended one, and
occasionally exceeded 20s — failing at exactly ~20.15s, i.e. the wait timed
out, not because the line was missing. The dbus errors in the failure log
("Failed to connect to the bus") are benign noise unrelated to CDP startup
(reproduced: Chromium prints the DevTools line even with a broken
DBUS_SESSION_BUS_ADDRESS). It is a pre-existing flake on main, independent of
any product change.

Evidence (recent main "Test for the server/ directory" runs):
  PASS 6.4s / 6.6s / 6.7s / 6.8s (warm) ... PASS 10.1s / 14.3s / 15.4s / 17.0s
  (slow) ... FAIL 20.14s / 20.15s / 20.16s (timeout). Reproduced locally with
  the runner's exact Chromium 148.0.7778.0 snapshot build: the line is printed
  reliably (~0.5s warm); the failures are latency, not a missing line.

Fix:
- Raise the per-launch detection wait to 60s (named browserDetectTimeout with
  rationale) so the slow cold-start tail clears while still failing fast if the
  browser truly never comes up. This is the actual flake fix.
- findBrowserBinary now honors CHROMIUM_BIN/CHROME_BIN and prefers the
  deterministic chrome/google-chrome binaries; server-test.yaml pins
  CHROMIUM_BIN to setup-chrome's chrome-path. This is hygiene (deterministic
  browser across runner-image updates), not the root cause.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rgarcia rgarcia force-pushed the fix/devtoolsproxy-upstream-detection branch from c90ff59 to 60e07ab Compare June 10, 2026 16:20
@rgarcia rgarcia changed the title Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (pin Chrome for Testing) Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (raise cold-start timeout) Jun 10, 2026
@rgarcia

rgarcia commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Closing — rolled the real fix (raise the UpstreamManager detect timeout 20s→60s) into #282. The browser-binary pinning (CHROMIUM_BIN + candidate reorder) explored here turned out to be unnecessary: the failure was a too-tight timeout on slow CI runners, not a browser that doesn't emit the DevTools line. #282 carries the timeout change as test(devtoolsproxy): raise UpstreamManager detect timeout 20s -> 60s.

@rgarcia rgarcia closed this Jun 11, 2026
rgarcia added a commit that referenced this pull request Jun 11, 2026
…cess skip, devtoolsproxy timeout (#282)

Several related e2e/server fixes (the shared/upstream-appropriate pieces
of the hypeman-e2e work, plus a pre-existing CI flake).

## 1. ChromeDriver proxy forwards a non-loopback Host/Origin (HTTP 500)
ChromeDriver (Chrome 111+) rejects requests whose `Host`/`Origin` isn't
localhost/an IP (`"Host header or origin header ... not whitelisted or
localhost"`, HTTP 500). The `:9224` proxy did `r.Out.Host = r.In.Host`,
forwarding the inbound host; over a direct `127.0.0.1` hit (docker)
that's loopback so it works, but behind an ingress it 500s every request
(looks like a slow/never-ready chromedriver). Fix: keep `SetURL`'s
loopback host and strip `Origin`, in both the reverse-proxy path and
`handleCreateSession`. Adds a regression test.

## 2. Docker exec output includes stream-multiplexing headers
`dockerBackend.Exec` returned the raw multiplexed Docker stream. Use
testcontainers `exec.Multiplexed()` for clean combined stdout+stderr.

## 3. Backend.SupportsHostAccess + skip host-access tests on backends
without it
Adds `Backend.SupportsHostAccess()` (docker=true, hypeman=false);
`TestContainer.Start` `t.Skip`s when a test requests `HostAccess` on a
backend that can't bridge to the runner's loopback. Keyed on the
request, so a test that doesn't ask for HostAccess is never skipped.

## 4. devtoolsproxy: raise UpstreamManager detect timeout 20s -> 60s
`TestUpstreamManagerDetectsChromiumAndRestart` is a pre-existing flake
on `main`: it launches a real browser and waits for the `DevTools
listening on ws://...` line, but chromium cold-start has a long tail on
shared CI runners (recent runs: ~6s warm, 15-17s contended, occasionally
>20s — failing at exactly ~20.15s, a timeout not a missing line). Raise
the wait to 60s. (Supersedes #283, now closed; the browser-binary
pinning explored there was unnecessary — the timeout is the real fix.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant