test(l1): add restart stall reproduction test using eth-docker#6151
test(l1): add restart stall reproduction test using eth-docker#6151pablodeymo wants to merge 10 commits intomainfrom
Conversation
Add a Python script and Makefile targets to reproduce the snap sync restart stall bug reported on Discord, where ethrex stalls downloading headers after a restart. The test uses eth-docker (with ethrex + Prysm) and runs in two phases: - Phase 1: Fresh snap sync from scratch, wait for completion and block progress - Phase 2: Stop only the execution client (keeping consensus + volumes), restart it, and monitor for header download stall The restart phase is repeated multiple times (default 3) since the stall is intermittent. Slack notifications are sent at each phase transition using the same webhook pattern as the existing multisync monitoring. The --configure flag auto-writes eth-docker's .env for ethrex + Prysm with the specified network and fee recipient address.
🤖 Kimi Code ReviewReview SummaryThis PR adds a comprehensive restart stall test for ethrex using eth-docker. The code is well-structured and follows good practices for testing infrastructure. Here are my findings: ✅ Strengths
|
🤖 Codex Code ReviewFindings
Notes
Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have enough context. Let me review the PR thoroughly. Code Review: PR #6151 - Restart Stall Reproduction TestOverviewThis PR adds a Python test script to reproduce an intermittent restart stall bug where ethrex stops downloading headers after a restart. The test automates the reproduction using eth-docker with ethrex + Prysm. Critical Issues1. Timeout Calculation Bug (restart_stall_test.py:36)SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 60)) * 60 # default 8hIssue: The default calculation is wrong. Impact: User-provided environment variables will be interpreted in seconds, not hours, causing premature timeouts. Fix: Either:
Same issue exists for:
2. Race Condition in Block Progress Monitoring (restart_stall_test.py:274-290)if block is None:
if time.time() - last_block_time > stall_timeout:
print(f" Node stopped responding for {fmt_time(stall_timeout)}")
return False, last_block - initial_block
elif block > last_block:
last_block = block
last_block_time = time.time()
elif time.time() - last_block_time > stall_timeout:
print(f" Block stalled at {last_block} for {fmt_time(stall_timeout)}")
return False, last_block - initial_blockIssue: When Recommendation: Add consistent sleep interval even when block is None. 3. Incomplete Error Handling in
|
There was a problem hiding this comment.
Pull request overview
Adds an automated reproduction script to detect intermittent ethrex restart stalls when running inside eth-docker, plus Makefile targets to run it consistently from tooling/sync.
Changes:
- Introduce
tooling/sync/restart_stall_test.pyto orchestrate a fresh sync, then perform repeated execution-client restarts while monitoring RPC block progress and saving logs. - Add
restart-stall-testandrestart-stall-test-skip-syncMakefile targets to run the script with standard defaults.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| tooling/sync/restart_stall_test.py | New eth-docker-based restart-stall reproduction script with RPC polling, log capture, and optional Slack notifications. |
| tooling/sync/Makefile | Adds convenience targets and variables for running the restart-stall test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ompts The ./ethd terminate command has an interactive Yes/No confirmation that blocks the script when running non-interactively in tmux. Replace it with docker compose down -v and docker compose up -d which work without prompts.
eth-docker doesn't publish the EL RPC port to the host by default. Adding el-shared.yml maps port 8545 to localhost so the monitoring script can poll eth_syncing and eth_blockNumber via RPC.
Greptile OverviewGreptile Summary
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| tooling/sync/Makefile | Adds restart-stall-test and restart-stall-test-skip-sync targets to run the new eth-docker-based reproduction script with configurable network/count/eth-docker directory. |
| tooling/sync/restart_stall_test.py | Introduces a Python automation script for reproducing restart-related sync stalls in eth-docker; main issues are inconsistent timeout semantics and lack of failure checking for subprocess calls/log collection assumptions. |
Sequence Diagram
sequenceDiagram
participant User
participant Make as Makefile (tooling/sync)
participant Script as restart_stall_test.py
participant EthD as eth-docker (./ethd)
participant Docker as docker compose
participant RPC as Ethrex JSON-RPC
participant Slack as Slack Webhook
participant FS as Local FS (logs)
User->>Make: make restart-stall-test
Make->>Script: python3 restart_stall_test.py --configure ...
alt --configure
Script->>EthD: read default.env
Script->>EthD: write .env overrides (COMPOSE_FILE, NETWORK, ...)
end
alt Phase 1 (unless --skip-phase1)
Script->>EthD: ./ethd terminate
Script->>EthD: ./ethd up
loop until NODE_STARTUP_TIMEOUT
Script->>RPC: eth_blockNumber
RPC-->>Script: blockNumber/timeout
end
loop until SYNC_TIMEOUT
Script->>RPC: eth_syncing
RPC-->>Script: false/obj/timeout
end
loop for BLOCK_PROCESSING_DURATION
Script->>RPC: eth_blockNumber
RPC-->>Script: increasing/stall
end
Script->>Docker: docker compose logs execution/consensus
Docker-->>Script: logs
Script->>FS: write *_phase1.log
Script->>Slack: notify phase1 complete/fail
end
loop restart_count
Script->>Docker: docker compose stop execution
Script->>Docker: docker compose start execution
loop until NODE_STARTUP_TIMEOUT
Script->>RPC: eth_blockNumber
RPC-->>Script: blockNumber/timeout
end
loop until restart monitor timeout
Script->>RPC: eth_syncing
Script->>RPC: eth_blockNumber
RPC-->>Script: progress/stall/unresponsive
end
Script->>Docker: docker compose logs execution/consensus
Docker-->>Script: logs
Script->>FS: write *_restartN.log
Script->>Slack: notify on stall/unresponsive
end
Script->>FS: write summary.txt
Script->>Slack: final summary
Pass PYTHONUNBUFFERED=1 so output appears immediately when piped to tee in tmux. Add RESTART_TEST_FEE_RECIPIENT variable for the Ethereum address.
- Fix docstring to say prysm.yml (not lighthouse.yml) matching actual config - Anchor LOGS_DIR to script directory instead of cwd - Standardize SYNC_TIMEOUT to seconds (was minutes * 60, now direct seconds) - Extract BLOCK_STALL_TIMEOUT constant (was hard-coded 10*60) - Add check parameter to docker_compose_in_ethd, fail fast on non-zero exit - Retry initial block number fetch in wait_for_block_progress instead of or 0 - Pass stall_timeout as parameter to monitor_restart_for_stall instead of using global RESTART_STALL_TIMEOUT, fixing the mismatch between function timeout and stall detection threshold - Distinguish "timeout" (still progressing) from "stall" (no progress) in monitor_restart_for_stall return values - Log Slack notification failures instead of bare except pass - Use docker_compose_in_ethd for save_ethd_logs to ensure correct cwd/context
The configure_eth_docker function was not setting CHECKPOINT_SYNC_URL, causing Prysm to use the default.env value (hoodi) even when NETWORK=mainnet, resulting in a fatal fork mismatch and crash loop.
Slack webhooks are loaded regardless of where the script is launched from.
can optionally wipe all data volumes (EL, consensus, validator) and force a fresh snap sync from scratch. Includes wipe_data_volumes() helper that removes containers and volumes while preserving JWT, and a restart-stall-test-wipe Makefile target.
cycle now wipes all volumes and forces a fresh snap sync by default. Add --keep-data flag for the old stop/start behavior without wipe.
cycle wipes volumes, snap syncs from scratch, and verifies block progress. Use --restart-count N to limit cycles. Summary is printed and saved on exit.
| return "unresponsive", f"Node never responded after {fmt_time(elapsed)}" | ||
|
|
||
| # Monitor: is it syncing? Is it making progress? | ||
| last_block = rpc_block_number(rpc_url) or 0 |
There was a problem hiding this comment.
nit: rpc_block_number(rpc_url) or 0 conflates two states: node at block 0 (valid) and RPC unreachable (returns None). With the block-0 fix in rpc_block_number, this should use explicit None check:
last_block = rpc_block_number(rpc_url)
if last_block is None:
last_block = 0| docker_compose_in_ethd(eth_docker_dir, "rm", "-f", "-s", "execution", "consensus", check=True) | ||
|
|
||
| project = os.path.basename(eth_docker_dir) | ||
| volumes = [ |
There was a problem hiding this comment.
Docker Compose normalizes the project name: it lowercases it, strips non-alphanumeric characters (except - and _), and replaces path separators. os.path.basename returns the raw directory name, so if it contains uppercase or special chars (e.g. Eth-Docker → project eth-docker), the volume names won't match and removal silently fails.
Suggestion: use docker compose config --format json to get the actual project name, or normalize to match Compose's behavior:
import re
project = re.sub(r'[^a-z0-9_-]', '', os.path.basename(eth_docker_dir).lower())Or list/inspect the actual volumes:
result = subprocess.run(["docker", "volume", "ls", "--filter", f"name={project}_", "--format", "{{.Name}}"], capture_output=True, text=True)|
|
||
| def rpc_block_number(url: str): | ||
| result = rpc_call(url, "eth_blockNumber") | ||
| if result: |
There was a problem hiding this comment.
int("0x0", 16) is 0, which is falsy — so when the node is at block 0 (e.g. right after a wipe+restart), this returns None as if the RPC call failed.
Suggestion:
def rpc_block_number(url: str):
result = rpc_call(url, "eth_blockNumber")
if result is not None:
return int(result, 16)
return None
Motivation
A Discord user reported that ethrex sometimes stalls while downloading headers after a restart when running inside eth-docker. This is intermittent and hard to catch manually, so we need an automated reproduction test.
Description
Adds
tooling/sync/restart_stall_test.py— a Python script that automates the full reproduction cycle using eth-docker with ethrex (execution) + Prysm (consensus).How it works
Phase 1 — Fresh snap sync:
docker compose down -v)docker compose up -d)eth_syncingRPC until sync completes (8h timeout)Phase 2 — Continuous restart cycles (runs forever, Ctrl+C to stop):
Each cycle:
docker compose rm -f -s)ethrex-el-data,prysmconsensus-data,prysmvalidator-data(preserves JWT)docker compose up -d execution consensus)On Ctrl+C, prints a summary of all completed cycles and saves it to
summary.txt.Use
--restart-count Nto limit to N cycles instead of running forever.Phase 2 with
--keep-data— Quick restart (no wipe):docker compose stop execution), keeping consensus running and volumes intactdocker compose start execution)eth_blockNumberprogress, detects if the node stops advancingok(caught up),stall(no progress),timeout(still progressing but didn't catch up in time), orunresponsive(node never came back)This mode is useful for testing restart recovery without re-syncing.
Configuration
The
--configureflag auto-writes eth-docker's.envfromdefault.envwith:COMPOSE_FILE=prysm.yml:ethrex.yml:el-shared.yml(el-shared.yml publishes the RPC port to the host for monitoring)NETWORK(default: hoodi)CHECKPOINT_SYNC_URL(auto-set based on network: mainnet, hoodi, holesky, sepolia)FEE_RECIPIENT(optional, via--fee-recipient)ETHREX_DOCKER_REPO=ghcr.io/lambdaclass/ethrex.envUses
docker composedirectly instead of./ethdcommands to avoid interactive confirmation prompts when running non-interactively (e.g. in tmux).Timeouts (all configurable via env vars, all in seconds)
SYNC_TIMEOUTBLOCK_PROCESSING_DURATIONBLOCK_STALL_TIMEOUTRESTART_STALL_TIMEOUTNODE_STARTUP_TIMEOUTCHECK_INTERVALMakefile targets
make restart-stall-testmake restart-stall-test-skip-syncmake restart-stall-test-keep-dataVariables:
ETH_DOCKER_DIR(default:~/eth-docker),RESTART_TEST_NETWORK(default:hoodi),RESTART_TEST_COUNT(default:0= infinite),RESTART_TEST_FEE_RECIPIENT(optional).Files changed
tooling/sync/restart_stall_test.pytooling/sync/Makefilerestart-stall-test,restart-stall-test-skip-sync, andrestart-stall-test-keep-datatargetsHow to Test
Prerequisites: eth-docker cloned at
~/eth-docker, Docker installed.Logs are saved to
tooling/sync/restart_stall_logs/run_YYYYMMDD_HHMMSS/.