Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (raise cold-start timeout)#283
Closed
rgarcia wants to merge 1 commit into
Closed
Fix flaky TestUpstreamManagerDetectsChromiumAndRestart (raise cold-start timeout)#283rgarcia wants to merge 1 commit into
rgarcia wants to merge 1 commit into
Conversation
… timeout)
TestUpstreamManagerDetectsChromiumAndRestart launches a real headless
Chromium and waits for UpstreamManager to scrape the "DevTools listening on
ws://..." line. The wait was capped at 20s per browser launch.
Root cause: Chromium's cold-start latency has a long tail on GitHub's shared
ubuntu-latest runners. Across recent main runs this exact launch printed the
DevTools line in ~6s on a warm runner but took 10-17s on a contended one, and
occasionally exceeded 20s — failing at exactly ~20.15s, i.e. the wait timed
out, not because the line was missing. The dbus errors in the failure log
("Failed to connect to the bus") are benign noise unrelated to CDP startup
(reproduced: Chromium prints the DevTools line even with a broken
DBUS_SESSION_BUS_ADDRESS). It is a pre-existing flake on main, independent of
any product change.
Evidence (recent main "Test for the server/ directory" runs):
PASS 6.4s / 6.6s / 6.7s / 6.8s (warm) ... PASS 10.1s / 14.3s / 15.4s / 17.0s
(slow) ... FAIL 20.14s / 20.15s / 20.16s (timeout). Reproduced locally with
the runner's exact Chromium 148.0.7778.0 snapshot build: the line is printed
reliably (~0.5s warm); the failures are latency, not a missing line.
Fix:
- Raise the per-launch detection wait to 60s (named browserDetectTimeout with
rationale) so the slow cold-start tail clears while still failing fast if the
browser truly never comes up. This is the actual flake fix.
- findBrowserBinary now honors CHROMIUM_BIN/CHROME_BIN and prefers the
deterministic chrome/google-chrome binaries; server-test.yaml pins
CHROMIUM_BIN to setup-chrome's chrome-path. This is hygiene (deterministic
browser across runner-image updates), not the root cause.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c90ff59 to
60e07ab
Compare
Contributor
Author
|
Closing — rolled the real fix (raise the UpstreamManager detect timeout 20s→60s) into #282. The browser-binary pinning (CHROMIUM_BIN + candidate reorder) explored here turned out to be unnecessary: the failure was a too-tight timeout on slow CI runners, not a browser that doesn't emit the DevTools line. #282 carries the timeout change as |
rgarcia
added a commit
that referenced
this pull request
Jun 11, 2026
…cess skip, devtoolsproxy timeout (#282) Several related e2e/server fixes (the shared/upstream-appropriate pieces of the hypeman-e2e work, plus a pre-existing CI flake). ## 1. ChromeDriver proxy forwards a non-loopback Host/Origin (HTTP 500) ChromeDriver (Chrome 111+) rejects requests whose `Host`/`Origin` isn't localhost/an IP (`"Host header or origin header ... not whitelisted or localhost"`, HTTP 500). The `:9224` proxy did `r.Out.Host = r.In.Host`, forwarding the inbound host; over a direct `127.0.0.1` hit (docker) that's loopback so it works, but behind an ingress it 500s every request (looks like a slow/never-ready chromedriver). Fix: keep `SetURL`'s loopback host and strip `Origin`, in both the reverse-proxy path and `handleCreateSession`. Adds a regression test. ## 2. Docker exec output includes stream-multiplexing headers `dockerBackend.Exec` returned the raw multiplexed Docker stream. Use testcontainers `exec.Multiplexed()` for clean combined stdout+stderr. ## 3. Backend.SupportsHostAccess + skip host-access tests on backends without it Adds `Backend.SupportsHostAccess()` (docker=true, hypeman=false); `TestContainer.Start` `t.Skip`s when a test requests `HostAccess` on a backend that can't bridge to the runner's loopback. Keyed on the request, so a test that doesn't ask for HostAccess is never skipped. ## 4. devtoolsproxy: raise UpstreamManager detect timeout 20s -> 60s `TestUpstreamManagerDetectsChromiumAndRestart` is a pre-existing flake on `main`: it launches a real browser and waits for the `DevTools listening on ws://...` line, but chromium cold-start has a long tail on shared CI runners (recent runs: ~6s warm, 15-17s contended, occasionally >20s — failing at exactly ~20.15s, a timeout not a missing line). Raise the wait to 60s. (Supersedes #283, now closed; the browser-binary pinning explored there was unnecessary — the timeout is the real fix.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
TestUpstreamManagerDetectsChromiumAndRestart(server/lib/devtoolsproxy/proxy_test.go) fails intermittently in CI (blocking #282):Pre-existing flake on
main, independent of #282 (which only toucheschromedriverproxy+ e2e).Root cause (corrected after investigation)
The test launches a real headless Chromium and waits for
UpstreamManagerto scrape theDevTools listening on ws://...line from the browser log. The wait was capped at 20s per launch.The real cause is Chromium cold-start latency with a long tail on shared CI runners — the wait times out, the line is not missing. Hard evidence:
1. CI duration distribution across recent
mainruns of this exact test (same/usr/bin/chromiuminvocation):The failures cluster at exactly the 20s timeout — a classic too-tight-deadline signature, with PASS times already creeping to 17s on slow runners.
2. Reproduced with the runner's exact browser. GitHub's
ubuntu-latest(24.04)/usr/bin/chromiumis a symlink to an open-source Chromium 148.0.7778.0 snapshot build (installed byrunner-imagesfromchromium-browser-snapshots; confirmed via the runner-images README +install-google-chrome.sh). I downloaded that exact snapshot and ran the exact test command:Result: the
DevTools listening on ws://127.0.0.1:40185/...line is printed reliably (~0.5s warm, 20/20 iterations). So the earlier "this Chromium build never prints the line" theory was wrong.3. The dbus errors are a red herring. The failure log's
dbus/bus.cc:405 Could not parse server addressis benign — I reproduced it by setting a malformedDBUS_SESSION_BUS_ADDRESS, and Chromium still printed the DevTools line despite the dbus errors. dbus has nothing to do with CDP/DevTools startup.So: on a contended runner, the first Chromium launch occasionally takes >20s to bring up its DevTools HTTP server, and the test's 20s
waitForConditionexpires first.Fix
browserDetectTimeout, with the rationale documented) so the slow cold-start tail clears while still failing fast if the browser genuinely never comes up. This is the actual flake fix.findBrowserBinarynow honorsCHROMIUM_BIN/CHROME_BINand prefers the deterministicchrome/google-chromebinaries, andserver-test.yamlpinsCHROMIUM_BINtosetup-chrome'schrome-path. This makes the browser-under-test deterministic across runner-image updates; it does not by itself fix the latency flake.Test output
Relationship to #282
#282 is blocked only by this pre-existing flake. After this merges, #282 should rebase on
mainto pick up the timeout + workflow changes and get a green CI run.🤖 Generated with Claude Code