test(l1): add restart stall reproduction test using eth-docker by pablodeymo · Pull Request #6151 · lambdaclass/ethrex

pablodeymo · 2026-02-06T18:46:48Z

Motivation

A Discord user reported that ethrex sometimes stalls while downloading headers after a restart when running inside eth-docker. This is intermittent and hard to catch manually, so we need an automated reproduction test.

Description

Adds tooling/sync/restart_stall_test.py — a Python script that automates the full reproduction cycle using eth-docker with ethrex (execution) + Prysm (consensus).

How it works

Phase 1 — Fresh snap sync:

Stops any existing eth-docker containers and removes volumes (docker compose down -v)
Starts eth-docker (docker compose up -d)
Polls eth_syncing RPC until sync completes (8h timeout)
Verifies block progress post-sync (22m of new blocks)
Sends Slack notification on completion

Phase 2 — Continuous restart cycles (runs forever, Ctrl+C to stop):

Each cycle:

Removes execution and consensus containers (docker compose rm -f -s)
Removes all data volumes: ethrex-el-data, prysmconsensus-data, prysmvalidator-data (preserves JWT)
Recreates containers with fresh volumes (docker compose up -d execution consensus)
Waits for node RPC to come up
Waits for full snap sync completion (8h timeout)
Verifies block progress post-sync (22m)
Reports PASS/FAIL with sync time
Sends Slack notification on failure
Saves execution + consensus logs
Repeats from step 1

On Ctrl+C, prints a summary of all completed cycles and saves it to summary.txt.

Use --restart-count N to limit to N cycles instead of running forever.

Phase 2 with --keep-data — Quick restart (no wipe):

Stops only the execution client (docker compose stop execution), keeping consensus running and volumes intact
Restarts the execution client (docker compose start execution)
Monitors for stall: watches eth_blockNumber progress, detects if the node stops advancing
Reports result: ok (caught up), stall (no progress), timeout (still progressing but didn't catch up in time), or unresponsive (node never came back)

This mode is useful for testing restart recovery without re-syncing.

Configuration

The --configure flag auto-writes eth-docker's .env from default.env with:

COMPOSE_FILE=prysm.yml:ethrex.yml:el-shared.yml (el-shared.yml publishes the RPC port to the host for monitoring)
NETWORK (default: hoodi)
CHECKPOINT_SYNC_URL (auto-set based on network: mainnet, hoodi, holesky, sepolia)
FEE_RECIPIENT (optional, via --fee-recipient)
ETHREX_DOCKER_REPO=ghcr.io/lambdaclass/ethrex
Slack webhook URLs from the local .env

Uses docker compose directly instead of ./ethd commands to avoid interactive confirmation prompts when running non-interactively (e.g. in tmux).

Timeouts (all configurable via env vars, all in seconds)

Variable	Default	Description
`SYNC_TIMEOUT`	8 hours	Max time for initial sync
`BLOCK_PROCESSING_DURATION`	22 min	Post-sync block progress verification
`BLOCK_STALL_TIMEOUT`	10 min	Time without progress during initial sync to declare stall
`RESTART_STALL_TIMEOUT`	15 min	Time without progress after restart to declare stall
`NODE_STARTUP_TIMEOUT`	5 min	Time for node to respond after restart
`CHECK_INTERVAL`	10 sec	RPC polling interval

Makefile targets

Target	Description
`make restart-stall-test`	Full test: configure, sync, then continuous restart cycles (Ctrl+C to stop)
`make restart-stall-test-skip-sync`	Skip phase 1 (node must already be synced)
`make restart-stall-test-keep-data`	Restart without wiping data (stop/start only)

Variables: ETH_DOCKER_DIR (default: ~/eth-docker), RESTART_TEST_NETWORK (default: hoodi), RESTART_TEST_COUNT (default: 0 = infinite), RESTART_TEST_FEE_RECIPIENT (optional).

Files changed

File	Change
`tooling/sync/restart_stall_test.py`	Main script with Phase 1/2 logic, continuous cycling, data wipe, Slack notifications
`tooling/sync/Makefile`	Add `restart-stall-test`, `restart-stall-test-skip-sync`, and `restart-stall-test-keep-data` targets

How to Test

Prerequisites: eth-docker cloned at ~/eth-docker, Docker installed.

cd tooling/sync

# Full test: configure, sync, then loop forever (Ctrl+C to stop)
make restart-stall-test \
  RESTART_TEST_NETWORK=mainnet \
  RESTART_TEST_FEE_RECIPIENT=0xYourAddress

# Limit to 5 cycles
make restart-stall-test \
  RESTART_TEST_NETWORK=mainnet \
  RESTART_TEST_COUNT=5

# Skip initial sync (node already synced)
make restart-stall-test-skip-sync \
  RESTART_TEST_NETWORK=mainnet

# Quick restart without wipe (stop/start only)
make restart-stall-test-keep-data \
  RESTART_TEST_NETWORK=mainnet

# Or directly with Python
PYTHONUNBUFFERED=1 python3 restart_stall_test.py \
  --eth-docker-dir ~/eth-docker \
  --configure \
  --network mainnet \
  --fee-recipient 0xYourAddress \
  --ethrex-dir ~/ethrex

Logs are saved to tooling/sync/restart_stall_logs/run_YYYYMMDD_HHMMSS/.

Add a Python script and Makefile targets to reproduce the snap sync restart stall bug reported on Discord, where ethrex stalls downloading headers after a restart. The test uses eth-docker (with ethrex + Prysm) and runs in two phases: - Phase 1: Fresh snap sync from scratch, wait for completion and block progress - Phase 2: Stop only the execution client (keeping consensus + volumes), restart it, and monitor for header download stall The restart phase is repeated multiple times (default 3) since the stall is intermittent. Slack notifications are sent at each phase transition using the same webhook pattern as the existing multisync monitoring. The --configure flag auto-writes eth-docker's .env for ethrex + Prysm with the specified network and fee recipient address.

github-actions · 2026-02-06T18:47:23Z

🤖 Kimi Code Review

Review Summary

This PR adds a comprehensive restart stall test for ethrex using eth-docker. The code is well-structured and follows good practices for testing infrastructure. Here are my findings:

✅ Strengths

Clean separation between Makefile targets and Python implementation
Good error handling and logging throughout
Proper timeout handling with configurable environment variables
Comprehensive monitoring of sync progress and stall detection
Slack integration for notifications
Detailed logging and summary generation

⚠️ Issues Found

1. Race Condition in Phase 1 (restart_stall_test.py:396-398)

print("Starting eth-docker...")
ethd(eth_docker_dir, "up")
time.sleep(30)  # Arbitrary sleep

Issue: Fixed 30-second sleep doesn't guarantee containers are ready. Use wait_for_node() immediately after starting.

2. Potential Path Traversal (restart_stall_test.py:86-87)

env_file = os.path.join(eth_docker_dir, ".env")

Issue: eth_docker_dir comes from user input and isn't validated. Add path validation:

eth_docker_dir = os.path.abspath(args.eth_docker_dir)
if not eth_docker_dir.startswith(os.path.expanduser("~")):
    raise ValueError("eth-docker-dir must be within home directory")

3. Resource Leak on Early Exit (restart_stall_test.py:492-495)

Issue: When phase1_fresh_sync fails, containers aren't properly cleaned up. Add cleanup in error paths.

4. Missing Docker Health Check (restart_stall_test.py:178-189)

Issue: wait_for_node() only checks RPC availability, not actual sync status. Consider adding:

def wait_for_docker_health(eth_docker_dir: str, service: str, timeout: int) -> bool:
    """Wait for docker service to be healthy."""

5. Inconsistent Timeout Constants (restart_stall_test.py:23-27)

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 60)) * 60  # 480 minutes

Issue: Comment says 8h but calculation gives 8 minutes. Should be:

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8)) * 3600  # 8 hours

6. Missing Type Hints (restart_stall_test.py:1-597)

Issue: Several functions lack return type annotations. Add for consistency:

def fmt_time(secs: float) -> str:
def git_info(cwd: str = None) -> tuple[str, str]:

🔧 Minor Improvements

Makefile: Add .PHONY declarations for new targets
Python: Use pathlib.Path consistently instead of os.path
Logging: Consider using Python's logging module instead of print statements
Configuration: Validate network parameter against known networks
Error Handling: Add more specific exception types instead of generic Exception

🎯 Security Recommendations

Input Validation: Sanitize all user-provided paths and parameters
Command Injection: Use subprocess.run() with list arguments (already done ✅)
File Permissions: Ensure log directory has appropriate permissions
Network Security: Consider validating RPC URLs and adding authentication

The code is production-ready with the above fixes applied. The test framework provides excellent observability for debugging sync restart issues.

Automated review by Kimi (Moonshot AI) · custom prompt

github-actions · 2026-02-06T18:48:18Z

🤖 Codex Code Review

Findings

In tooling/sync/restart_stall_test.py:34-39, the comment says “timeouts (in seconds)”, but SYNC_TIMEOUT multiplies env value by 60 while others don’t. This makes SYNC_TIMEOUT effectively minutes while others are seconds; users setting SYNC_TIMEOUT in seconds will get 60x longer waits. Suggest: either remove * 60 or update comment and env var name to indicate minutes.
In tooling/sync/restart_stall_test.py:156-170, tooling/sync/restart_stall_test.py:370-377, tooling/sync/restart_stall_test.py:407-415, and tooling/sync/restart_stall_test.py:211-216, subprocess return codes are ignored. If ethd or docker compose fails, the script can keep going and misclassify failures as stalls. Suggest: check returncode, print stderr, and sys.exit(1) (or raise) on failure.

Notes

This PR only adds tooling (Makefile + Python test script). No EVM/gas/accounting, consensus, trie, RLP, or validation logic changes.

Automated review by OpenAI Codex · custom prompt

github-actions · 2026-02-06T18:48:45Z

🤖 Claude Code Review

Now I have enough context. Let me review the PR thoroughly.

Code Review: PR #6151 - Restart Stall Reproduction Test

Overview

This PR adds a Python test script to reproduce an intermittent restart stall bug where ethrex stops downloading headers after a restart. The test automates the reproduction using eth-docker with ethrex + Prysm.

Critical Issues

1. Timeout Calculation Bug (restart_stall_test.py:36)

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 60)) * 60  # default 8h

Issue: The default calculation is wrong. 8 * 60 = 480, then * 60 = 28,800 seconds = 8 hours ✓. But if a user sets SYNC_TIMEOUT=8 (expecting 8 hours), they'll get 8 * 60 = 480 seconds = 8 minutes ✗.

Impact: User-provided environment variables will be interpreted in seconds, not hours, causing premature timeouts.

Fix: Either:

Document that env vars must be in seconds, OR
Change to: SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 3600))

Same issue exists for:

BLOCK_PROCESSING_DURATION (line 36) - mixes minutes and seconds
RESTART_STALL_TIMEOUT (line 37)
NODE_STARTUP_TIMEOUT (line 38)

2. Race Condition in Block Progress Monitoring (restart_stall_test.py:274-290)

if block is None:
    if time.time() - last_block_time > stall_timeout:
        print(f"  Node stopped responding for {fmt_time(stall_timeout)}")
        return False, last_block - initial_block
elif block > last_block:
    last_block = block
    last_block_time = time.time()
elif time.time() - last_block_time > stall_timeout:
    print(f"  Block stalled at {last_block} for {fmt_time(stall_timeout)}")
    return False, last_block - initial_block

Issue: When block is None but hasn't exceeded stall_timeout, the code continues without updating last_status_print or sleeping. This can lead to tight loops if RPC becomes intermittently unavailable.

Recommendation: Add consistent sleep interval even when block is None.

3. Incomplete Error Handling in `configure_eth_docker` (restart_stall_test.py:43-101)

with open(default_env) as f:
    lines = f.readlines()

Issue: No exception handling for file I/O operations. If default.env is unreadable (permissions), the script crashes without cleanup.

Recommendation: Add try-except around file operations with meaningful error messages.

Medium Priority Issues

4. Hardcoded Dockerfile Name (restart_stall_test.py:63)

"ETHREX_DOCKERFILE": "Dockerfile.binary",

Issue: This hardcodes a specific Dockerfile. If the project renames this file or uses a different build strategy, the test breaks silently.

Recommendation: Make this configurable or verify the file exists in ethrex repo.

5. Silent Slack Notification Failures (restart_stall_test.py:196)

try:
    requests.post(url, json={"blocks": blocks}, timeout=10)
except Exception:
    pass

Issue: All Slack failures are silently swallowed. Users won't know if notifications failed due to network issues, invalid webhook URLs, or API errors.

Recommendation: Log failures at minimum: except Exception as e: print(f" [Slack notification failed: {e}]")

6. Logs Collection Can Fail Silently (restart_stall_test.py:217)

except Exception as e:
    print(f"  Failed to save {service} logs: {e}")

Good: Error is printed. Concern: After a stall detection, logs are critical for debugging. Script continues even if log collection fails, potentially losing valuable diagnostic data.

Recommendation: Consider warning users more prominently or making log collection failures non-silent in CI environments.

7. Potential Docker Zombie Processes

The script doesn't ensure cleanup on Ctrl+C or exceptions. If the script crashes during Phase 1 or 2, eth-docker containers may be left running.

Recommendation: Add signal handlers or use try/finally blocks to ensure ethd down or at least print instructions for manual cleanup.

Minor Issues / Style

8. Type Safety with `rpc_call` Return (restart_stall_test.py:138-144)

def rpc_call(url: str, method: str, params=None):
    try:
        # ...
        return resp.json().get("result")
    except Exception:
        return None

Issue: Returns None on any exception (network, JSON parsing, missing "result" key). Callers can't distinguish between "sync complete" (False) and "error" (None).

Impact: In wait_for_sync (line 243), the logic treats None (error) differently than False (synced), which is good. But the broad except Exception masks all errors (DNS, timeout, invalid JSON).

Recommendation: Consider more specific exception handling or logging errors for debugging.

9. Inconsistent Time Units in Comments

Line 36: 8 * 60)) * 60 # default 8h - confusing mix of multiplication
Line 36: 22 * 60)) # default 22m - units in seconds but comment says minutes

Recommendation: Standardize comments to match actual units after calculation.

10. Magic Number: 15 Minute Stall Detection (restart_stall_test.py:331)

elif time.time() - last_progress_time > RESTART_STALL_TIMEOUT:

But the timeout is defined as 15 minutes (line 37). For header download stalls in Ethereum, 15 minutes seems reasonable, but this should be justified or made more configurable based on network conditions.

11. Makefile: Missing Default Value Documentation

RESTART_TEST_NETWORK ?= hoodi
RESTART_TEST_COUNT ?= 3
ETH_DOCKER_DIR ?= $(HOME)/eth-docker

Good: Uses sensible defaults. Minor: The Makefile help text doesn't document these variables. Users might not know they can override them.

Security Considerations

12. Shell Injection Risk in `subprocess.run` (restart_stall_test.py:156-168)

def ethd(eth_docker_dir: str, *args) -> subprocess.CompletedProcess:
    cmd = ["./ethd"] + list(args)
    return subprocess.run(cmd, cwd=eth_docker_dir, capture_output=True, text=True)

Assessment: ✓ Safe - uses list form of subprocess (no shell=True), so no injection risk.

13. Environment Variable Injection

The script loads .env files and passes values to eth-docker config. If .env contains malicious content, it could be passed to Docker commands.

Assessment: Low risk - eth-docker is a trusted tool, and users control their own .env. But worth noting for defense-in-depth.

Positive Aspects

✅ Well-structured phased approach - Clean separation of sync vs restart testing
✅ Comprehensive logging - Saves logs after each phase
✅ Good timeout configurability - All timeouts exposed as env vars
✅ Proper use of subprocess - No shell injection vulnerabilities
✅ Detailed progress reporting - Users can monitor test progress
✅ Slack integration - Good for CI/CD notification
✅ Git context in notifications - Includes branch/commit for debugging

Recommendations Summary

CRITICAL: Fix timeout calculation inconsistency (issue docs: add milestones #1)
HIGH: Add error handling for file operations (issue build: add Github actions boilerplate #3)
MEDIUM: Log Slack notification failures (issue feat: add basic RPC api. #5)
MEDIUM: Add cleanup handlers for Ctrl+C (issue Add Dockerfile #7)
LOW: Improve error visibility in RPC calls (issue build: add Dockerfile. #8)
LOW: Document Makefile variables (issue feat: add fmt, test and clippy to CI #11)

Ethereum-Specific Considerations

Header Download Stall: The test correctly targets a real issue in sync clients. 15-minute stall timeout is reasonable for detecting genuine problems vs. temporary network hiccups.
Consensus + Execution Split: Good test design - only restarts execution layer while keeping consensus running, which mirrors real restart scenarios.
RPC Monitoring: Uses standard eth_syncing and eth_blockNumber - appropriate for detecting stalls.

Overall Assessment

The code is functionally sound and addresses a real problem with automated testing. The main issues are around edge case handling (timeouts, errors) rather than core logic. With the critical timeout fix and improved error handling, this is a valuable addition to the tooling suite.

Recommendation: Request changes for issue #1 (timeout calculation), then approve after fix.

Automated review by Claude (Anthropic) · custom prompt

Copilot

Pull request overview

Adds an automated reproduction script to detect intermittent ethrex restart stalls when running inside eth-docker, plus Makefile targets to run it consistently from tooling/sync.

Changes:

Introduce tooling/sync/restart_stall_test.py to orchestrate a fresh sync, then perform repeated execution-client restarts while monitoring RPC block progress and saving logs.
Add restart-stall-test and restart-stall-test-skip-sync Makefile targets to run the script with standard defaults.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File	Description
tooling/sync/restart_stall_test.py	New eth-docker-based restart-stall reproduction script with RPC polling, log capture, and optional Slack notifications.
tooling/sync/Makefile	Adds convenience targets and variables for running the restart-stall test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tooling/sync/restart_stall_test.py

…ompts The ./ethd terminate command has an interactive Yes/No confirmation that blocks the script when running non-interactively in tmux. Replace it with docker compose down -v and docker compose up -d which work without prompts.

eth-docker doesn't publish the EL RPC port to the host by default. Adding el-shared.yml maps port 8545 to localhost so the monitoring script can poll eth_syncing and eth_blockNumber via RPC.

greptile-apps · 2026-02-06T19:01:30Z

Greptile Overview

Greptile Summary

Adds a new tooling/sync/restart_stall_test.py script to reproduce an intermittent “stall after restart” issue when running ethrex under eth-docker (ethrex EL + Prysm CL).
Extends tooling/sync/Makefile with targets to run the script end-to-end (configure eth-docker, fresh sync, then repeated restart cycles) or skip the initial sync.
Script drives eth-docker via ./ethd and docker compose, monitors ethrex via JSON-RPC (eth_syncing, eth_blockNumber), saves logs per phase/restart, and optionally posts Slack notifications.

Confidence Score: 3/5

This PR is mergeable after fixing a few correctness issues in the test script that can lead to misleading results.
The changes are isolated to tooling, but the new script currently has inconsistent timeout semantics and does not check subprocess failures, which can cause false stall/unresponsive reports and unreliable automation output.
tooling/sync/restart_stall_test.py

Important Files Changed

Filename	Overview
tooling/sync/Makefile	Adds `restart-stall-test` and `restart-stall-test-skip-sync` targets to run the new eth-docker-based reproduction script with configurable network/count/eth-docker directory.
tooling/sync/restart_stall_test.py	Introduces a Python automation script for reproducing restart-related sync stalls in eth-docker; main issues are inconsistent timeout semantics and lack of failure checking for subprocess calls/log collection assumptions.

Sequence Diagram

sequenceDiagram
    participant User
    participant Make as Makefile (tooling/sync)
    participant Script as restart_stall_test.py
    participant EthD as eth-docker (./ethd)
    participant Docker as docker compose
    participant RPC as Ethrex JSON-RPC
    participant Slack as Slack Webhook
    participant FS as Local FS (logs)

    User->>Make: make restart-stall-test
    Make->>Script: python3 restart_stall_test.py --configure ...

    alt --configure
        Script->>EthD: read default.env
        Script->>EthD: write .env overrides (COMPOSE_FILE, NETWORK, ...)
    end

    alt Phase 1 (unless --skip-phase1)
        Script->>EthD: ./ethd terminate
        Script->>EthD: ./ethd up
        loop until NODE_STARTUP_TIMEOUT
            Script->>RPC: eth_blockNumber
            RPC-->>Script: blockNumber/timeout
        end
        loop until SYNC_TIMEOUT
            Script->>RPC: eth_syncing
            RPC-->>Script: false/obj/timeout
        end
        loop for BLOCK_PROCESSING_DURATION
            Script->>RPC: eth_blockNumber
            RPC-->>Script: increasing/stall
        end
        Script->>Docker: docker compose logs execution/consensus
        Docker-->>Script: logs
        Script->>FS: write *_phase1.log
        Script->>Slack: notify phase1 complete/fail
    end

    loop restart_count
        Script->>Docker: docker compose stop execution
        Script->>Docker: docker compose start execution
        loop until NODE_STARTUP_TIMEOUT
            Script->>RPC: eth_blockNumber
            RPC-->>Script: blockNumber/timeout
        end
        loop until restart monitor timeout
            Script->>RPC: eth_syncing
            Script->>RPC: eth_blockNumber
            RPC-->>Script: progress/stall/unresponsive
        end
        Script->>Docker: docker compose logs execution/consensus
        Docker-->>Script: logs
        Script->>FS: write *_restartN.log
        Script->>Slack: notify on stall/unresponsive
    end

    Script->>FS: write summary.txt
    Script->>Slack: final summary

greptile-apps

_{2 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

tooling/sync/restart_stall_test.py

Pass PYTHONUNBUFFERED=1 so output appears immediately when piped to tee in tmux. Add RESTART_TEST_FEE_RECIPIENT variable for the Ethereum address.

- Fix docstring to say prysm.yml (not lighthouse.yml) matching actual config - Anchor LOGS_DIR to script directory instead of cwd - Standardize SYNC_TIMEOUT to seconds (was minutes * 60, now direct seconds) - Extract BLOCK_STALL_TIMEOUT constant (was hard-coded 10*60) - Add check parameter to docker_compose_in_ethd, fail fast on non-zero exit - Retry initial block number fetch in wait_for_block_progress instead of or 0 - Pass stall_timeout as parameter to monitor_restart_for_stall instead of using global RESTART_STALL_TIMEOUT, fixing the mismatch between function timeout and stall detection threshold - Distinguish "timeout" (still progressing) from "stall" (no progress) in monitor_restart_for_stall return values - Log Slack notification failures instead of bare except pass - Use docker_compose_in_ethd for save_ethd_logs to ensure correct cwd/context

The configure_eth_docker function was not setting CHECKPOINT_SYNC_URL, causing Prysm to use the default.env value (hoodi) even when NETWORK=mainnet, resulting in a fatal fork mismatch and crash loop.

Slack webhooks are loaded regardless of where the script is launched from.

can optionally wipe all data volumes (EL, consensus, validator) and force a fresh snap sync from scratch. Includes wipe_data_volumes() helper that removes containers and volumes while preserving JWT, and a restart-stall-test-wipe Makefile target.

cycle now wipes all volumes and forces a fresh snap sync by default. Add --keep-data flag for the old stop/start behavior without wipe.

cycle wipes volumes, snap syncs from scratch, and verifies block progress. Use --restart-count N to limit cycles. Summary is printed and saved on exit.

ElFantasma · 2026-02-10T01:04:00Z

tooling/sync/restart_stall_test.py

+        return "unresponsive", f"Node never responded after {fmt_time(elapsed)}"
+
+    # Monitor: is it syncing? Is it making progress?
+    last_block = rpc_block_number(rpc_url) or 0


nit: rpc_block_number(rpc_url) or 0 conflates two states: node at block 0 (valid) and RPC unreachable (returns None). With the block-0 fix in rpc_block_number, this should use explicit None check:

last_block = rpc_block_number(rpc_url) if last_block is None: last_block = 0

ElFantasma · 2026-02-11T18:52:07Z

tooling/sync/restart_stall_test.py

+    docker_compose_in_ethd(eth_docker_dir, "rm", "-f", "-s", "execution", "consensus", check=True)
+
+    project = os.path.basename(eth_docker_dir)
+    volumes = [


Docker Compose normalizes the project name: it lowercases it, strips non-alphanumeric characters (except - and _), and replaces path separators. os.path.basename returns the raw directory name, so if it contains uppercase or special chars (e.g. Eth-Docker → project eth-docker), the volume names won't match and removal silently fails.

Suggestion: use docker compose config --format json to get the actual project name, or normalize to match Compose's behavior:

import re project = re.sub(r'[^a-z0-9_-]', '', os.path.basename(eth_docker_dir).lower())

Or list/inspect the actual volumes:

result = subprocess.run(["docker", "volume", "ls", "--filter", f"name={project}_", "--format", "{{.Name}}"], capture_output=True, text=True)

ElFantasma · 2026-02-11T18:53:01Z

tooling/sync/restart_stall_test.py

+
+def rpc_block_number(url: str):
+    result = rpc_call(url, "eth_blockNumber")
+    if result:


int("0x0", 16) is 0, which is falsy — so when the node is at block 0 (e.g. right after a wipe+restart), this returns None as if the RPC call failed.

Suggestion:

def rpc_block_number(url: str): result = rpc_call(url, "eth_blockNumber") if result is not None: return int(result, 16) return None

pablodeymo requested a review from a team as a code owner February 6, 2026 18:46

Copilot AI review requested due to automatic review settings February 6, 2026 18:46

github-actions bot assigned pablodeymo Feb 6, 2026

github-actions bot added the L1 Ethereum client label Feb 6, 2026

github-project-automation bot added this to ethrex_l1 Feb 6, 2026

Copilot started reviewing on behalf of pablodeymo February 6, 2026 18:47 View session

pablodeymo changed the title ~~tooling(l1): add restart stall reproduction test using eth-docker~~ test(l1): add restart stall reproduction test using eth-docker Feb 6, 2026

Copilot AI reviewed Feb 6, 2026

View reviewed changes

pablodeymo added 2 commits February 6, 2026 15:55

Add el-shared.yml to COMPOSE_FILE to expose RPC port to host

c1317fb

eth-docker doesn't publish the EL RPC port to the host by default. Adding el-shared.yml maps port 8545 to localhost so the monitoring script can poll eth_syncing and eth_blockNumber via RPC.

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

tooling/sync/restart_stall_test.py Outdated Show resolved Hide resolved

tooling/sync/restart_stall_test.py Outdated Show resolved Hide resolved

tooling/sync/restart_stall_test.py Outdated Show resolved Hide resolved

pablodeymo added 3 commits February 6, 2026 16:01

Add fee-recipient and PYTHONUNBUFFERED to Makefile restart-stall targets

0e83992

Pass PYTHONUNBUFFERED=1 so output appears immediately when piped to tee in tmux. Add RESTART_TEST_FEE_RECIPIENT variable for the Ethereum address.

Add CHECKPOINT_SYNC_URL to eth-docker configuration based on network.

29c2880

The configure_eth_docker function was not setting CHECKPOINT_SYNC_URL, causing Prysm to use the default.env value (hoodi) even when NETWORK=mainnet, resulting in a fatal fork mismatch and crash loop.

azteca1998 approved these changes Feb 9, 2026

View reviewed changes

pablodeymo added 4 commits February 9, 2026 16:16

Resolve .env path relative to script directory instead of CWD so

192ee9e

Slack webhooks are loaded regardless of where the script is launched from.

Make data wipe the default behavior in restart stall test. Each Phase 2

7edb2a8

cycle now wipes all volumes and forces a fresh snap sync by default. Add --keep-data flag for the old stop/start behavior without wipe.

Loop restart cycles indefinitely by default (Ctrl+C to stop). Each

b7dabc9

cycle wipes volumes, snap syncs from scratch, and verifies block progress. Use --restart-count N to limit cycles. Summary is printed and saved on exit.

ElFantasma reviewed Feb 11, 2026

View reviewed changes

Conversation

pablodeymo commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

How it works

Configuration

Timeouts (all configurable via env vars, all in seconds)

Makefile targets

Files changed

How to Test

Uh oh!

github-actions bot commented Feb 6, 2026

🤖 Kimi Code Review

Review Summary

✅ Strengths

⚠️ Issues Found

1. Race Condition in Phase 1 (restart_stall_test.py:396-398)

2. Potential Path Traversal (restart_stall_test.py:86-87)

3. Resource Leak on Early Exit (restart_stall_test.py:492-495)

4. Missing Docker Health Check (restart_stall_test.py:178-189)

5. Inconsistent Timeout Constants (restart_stall_test.py:23-27)

6. Missing Type Hints (restart_stall_test.py:1-597)

🔧 Minor Improvements

🎯 Security Recommendations

Uh oh!

github-actions bot commented Feb 6, 2026

🤖 Codex Code Review

Uh oh!

github-actions bot commented Feb 6, 2026

🤖 Claude Code Review

Code Review: PR #6151 - Restart Stall Reproduction Test

Overview

Critical Issues

1. Timeout Calculation Bug (restart_stall_test.py:36)

2. Race Condition in Block Progress Monitoring (restart_stall_test.py:274-290)

3. Incomplete Error Handling in configure_eth_docker (restart_stall_test.py:43-101)

Medium Priority Issues

4. Hardcoded Dockerfile Name (restart_stall_test.py:63)

5. Silent Slack Notification Failures (restart_stall_test.py:196)

6. Logs Collection Can Fail Silently (restart_stall_test.py:217)

7. Potential Docker Zombie Processes

Minor Issues / Style

8. Type Safety with rpc_call Return (restart_stall_test.py:138-144)

9. Inconsistent Time Units in Comments

10. Magic Number: 15 Minute Stall Detection (restart_stall_test.py:331)

11. Makefile: Missing Default Value Documentation

Security Considerations

12. Shell Injection Risk in subprocess.run (restart_stall_test.py:156-168)

13. Environment Variable Injection

Positive Aspects

Recommendations Summary

Ethereum-Specific Considerations

Overall Assessment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pablodeymo commented Feb 6, 2026 •

edited

Loading

3. Incomplete Error Handling in `configure_eth_docker` (restart_stall_test.py:43-101)

8. Type Safety with `rpc_call` Return (restart_stall_test.py:138-144)

12. Shell Injection Risk in `subprocess.run` (restart_stall_test.py:156-168)