Skip to content

test(l1): add restart stall reproduction test using eth-docker#6151

Open
pablodeymo wants to merge 10 commits intomainfrom
tooling/restart-stall-test
Open

test(l1): add restart stall reproduction test using eth-docker#6151
pablodeymo wants to merge 10 commits intomainfrom
tooling/restart-stall-test

Conversation

@pablodeymo
Copy link
Contributor

@pablodeymo pablodeymo commented Feb 6, 2026

Motivation

A Discord user reported that ethrex sometimes stalls while downloading headers after a restart when running inside eth-docker. This is intermittent and hard to catch manually, so we need an automated reproduction test.

Description

Adds tooling/sync/restart_stall_test.py — a Python script that automates the full reproduction cycle using eth-docker with ethrex (execution) + Prysm (consensus).

How it works

Phase 1 — Fresh snap sync:

  1. Stops any existing eth-docker containers and removes volumes (docker compose down -v)
  2. Starts eth-docker (docker compose up -d)
  3. Polls eth_syncing RPC until sync completes (8h timeout)
  4. Verifies block progress post-sync (22m of new blocks)
  5. Sends Slack notification on completion

Phase 2 — Continuous restart cycles (runs forever, Ctrl+C to stop):

Each cycle:

  1. Removes execution and consensus containers (docker compose rm -f -s)
  2. Removes all data volumes: ethrex-el-data, prysmconsensus-data, prysmvalidator-data (preserves JWT)
  3. Recreates containers with fresh volumes (docker compose up -d execution consensus)
  4. Waits for node RPC to come up
  5. Waits for full snap sync completion (8h timeout)
  6. Verifies block progress post-sync (22m)
  7. Reports PASS/FAIL with sync time
  8. Sends Slack notification on failure
  9. Saves execution + consensus logs
  10. Repeats from step 1

On Ctrl+C, prints a summary of all completed cycles and saves it to summary.txt.

Use --restart-count N to limit to N cycles instead of running forever.

Phase 2 with --keep-data — Quick restart (no wipe):

  1. Stops only the execution client (docker compose stop execution), keeping consensus running and volumes intact
  2. Restarts the execution client (docker compose start execution)
  3. Monitors for stall: watches eth_blockNumber progress, detects if the node stops advancing
  4. Reports result: ok (caught up), stall (no progress), timeout (still progressing but didn't catch up in time), or unresponsive (node never came back)

This mode is useful for testing restart recovery without re-syncing.

Configuration

The --configure flag auto-writes eth-docker's .env from default.env with:

  • COMPOSE_FILE=prysm.yml:ethrex.yml:el-shared.yml (el-shared.yml publishes the RPC port to the host for monitoring)
  • NETWORK (default: hoodi)
  • CHECKPOINT_SYNC_URL (auto-set based on network: mainnet, hoodi, holesky, sepolia)
  • FEE_RECIPIENT (optional, via --fee-recipient)
  • ETHREX_DOCKER_REPO=ghcr.io/lambdaclass/ethrex
  • Slack webhook URLs from the local .env

Uses docker compose directly instead of ./ethd commands to avoid interactive confirmation prompts when running non-interactively (e.g. in tmux).

Timeouts (all configurable via env vars, all in seconds)

Variable Default Description
SYNC_TIMEOUT 8 hours Max time for initial sync
BLOCK_PROCESSING_DURATION 22 min Post-sync block progress verification
BLOCK_STALL_TIMEOUT 10 min Time without progress during initial sync to declare stall
RESTART_STALL_TIMEOUT 15 min Time without progress after restart to declare stall
NODE_STARTUP_TIMEOUT 5 min Time for node to respond after restart
CHECK_INTERVAL 10 sec RPC polling interval

Makefile targets

Target Description
make restart-stall-test Full test: configure, sync, then continuous restart cycles (Ctrl+C to stop)
make restart-stall-test-skip-sync Skip phase 1 (node must already be synced)
make restart-stall-test-keep-data Restart without wiping data (stop/start only)

Variables: ETH_DOCKER_DIR (default: ~/eth-docker), RESTART_TEST_NETWORK (default: hoodi), RESTART_TEST_COUNT (default: 0 = infinite), RESTART_TEST_FEE_RECIPIENT (optional).

Files changed

File Change
tooling/sync/restart_stall_test.py Main script with Phase 1/2 logic, continuous cycling, data wipe, Slack notifications
tooling/sync/Makefile Add restart-stall-test, restart-stall-test-skip-sync, and restart-stall-test-keep-data targets

How to Test

Prerequisites: eth-docker cloned at ~/eth-docker, Docker installed.

cd tooling/sync

# Full test: configure, sync, then loop forever (Ctrl+C to stop)
make restart-stall-test \
  RESTART_TEST_NETWORK=mainnet \
  RESTART_TEST_FEE_RECIPIENT=0xYourAddress

# Limit to 5 cycles
make restart-stall-test \
  RESTART_TEST_NETWORK=mainnet \
  RESTART_TEST_COUNT=5

# Skip initial sync (node already synced)
make restart-stall-test-skip-sync \
  RESTART_TEST_NETWORK=mainnet

# Quick restart without wipe (stop/start only)
make restart-stall-test-keep-data \
  RESTART_TEST_NETWORK=mainnet

# Or directly with Python
PYTHONUNBUFFERED=1 python3 restart_stall_test.py \
  --eth-docker-dir ~/eth-docker \
  --configure \
  --network mainnet \
  --fee-recipient 0xYourAddress \
  --ethrex-dir ~/ethrex

Logs are saved to tooling/sync/restart_stall_logs/run_YYYYMMDD_HHMMSS/.

Add a Python script and Makefile targets to reproduce the snap sync restart
stall bug reported on Discord, where ethrex stalls downloading headers after
a restart.

The test uses eth-docker (with ethrex + Prysm) and runs in two phases:
- Phase 1: Fresh snap sync from scratch, wait for completion and block progress
- Phase 2: Stop only the execution client (keeping consensus + volumes),
  restart it, and monitor for header download stall

The restart phase is repeated multiple times (default 3) since the stall is
intermittent. Slack notifications are sent at each phase transition using the
same webhook pattern as the existing multisync monitoring.

The --configure flag auto-writes eth-docker's .env for ethrex + Prysm with
the specified network and fee recipient address.
@pablodeymo pablodeymo requested a review from a team as a code owner February 6, 2026 18:46
Copilot AI review requested due to automatic review settings February 6, 2026 18:46
@github-actions github-actions bot added the L1 Ethereum client label Feb 6, 2026
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 Kimi Code Review

Review Summary

This PR adds a comprehensive restart stall test for ethrex using eth-docker. The code is well-structured and follows good practices for testing infrastructure. Here are my findings:

Strengths

  • Clean separation between Makefile targets and Python implementation
  • Good error handling and logging throughout
  • Proper timeout handling with configurable environment variables
  • Comprehensive monitoring of sync progress and stall detection
  • Slack integration for notifications
  • Detailed logging and summary generation

⚠️ Issues Found

1. Race Condition in Phase 1 (restart_stall_test.py:396-398)

print("Starting eth-docker...")
ethd(eth_docker_dir, "up")
time.sleep(30)  # Arbitrary sleep

Issue: Fixed 30-second sleep doesn't guarantee containers are ready. Use wait_for_node() immediately after starting.

2. Potential Path Traversal (restart_stall_test.py:86-87)

env_file = os.path.join(eth_docker_dir, ".env")

Issue: eth_docker_dir comes from user input and isn't validated. Add path validation:

eth_docker_dir = os.path.abspath(args.eth_docker_dir)
if not eth_docker_dir.startswith(os.path.expanduser("~")):
    raise ValueError("eth-docker-dir must be within home directory")

3. Resource Leak on Early Exit (restart_stall_test.py:492-495)

Issue: When phase1_fresh_sync fails, containers aren't properly cleaned up. Add cleanup in error paths.

4. Missing Docker Health Check (restart_stall_test.py:178-189)

Issue: wait_for_node() only checks RPC availability, not actual sync status. Consider adding:

def wait_for_docker_health(eth_docker_dir: str, service: str, timeout: int) -> bool:
    """Wait for docker service to be healthy."""

5. Inconsistent Timeout Constants (restart_stall_test.py:23-27)

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 60)) * 60  # 480 minutes

Issue: Comment says 8h but calculation gives 8 minutes. Should be:

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8)) * 3600  # 8 hours

6. Missing Type Hints (restart_stall_test.py:1-597)

Issue: Several functions lack return type annotations. Add for consistency:

def fmt_time(secs: float) -> str:
def git_info(cwd: str = None) -> tuple[str, str]:

🔧 Minor Improvements

  1. Makefile: Add .PHONY declarations for new targets
  2. Python: Use pathlib.Path consistently instead of os.path
  3. Logging: Consider using Python's logging module instead of print statements
  4. Configuration: Validate network parameter against known networks
  5. Error Handling: Add more specific exception types instead of generic Exception

🎯 Security Recommendations

  1. Input Validation: Sanitize all user-provided paths and parameters
  2. Command Injection: Use subprocess.run() with list arguments (already done ✅)
  3. File Permissions: Ensure log directory has appropriate permissions
  4. Network Security: Consider validating RPC URLs and adding authentication

The code is production-ready with the above fixes applied. The test framework provides excellent observability for debugging sync restart issues.


Automated review by Kimi (Moonshot AI) · custom prompt

@pablodeymo pablodeymo changed the title tooling(l1): add restart stall reproduction test using eth-docker test(l1): add restart stall reproduction test using eth-docker Feb 6, 2026
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 Codex Code Review

Findings

  • In tooling/sync/restart_stall_test.py:34-39, the comment says “timeouts (in seconds)”, but SYNC_TIMEOUT multiplies env value by 60 while others don’t. This makes SYNC_TIMEOUT effectively minutes while others are seconds; users setting SYNC_TIMEOUT in seconds will get 60x longer waits. Suggest: either remove * 60 or update comment and env var name to indicate minutes.
  • In tooling/sync/restart_stall_test.py:156-170, tooling/sync/restart_stall_test.py:370-377, tooling/sync/restart_stall_test.py:407-415, and tooling/sync/restart_stall_test.py:211-216, subprocess return codes are ignored. If ethd or docker compose fails, the script can keep going and misclassify failures as stalls. Suggest: check returncode, print stderr, and sys.exit(1) (or raise) on failure.

Notes

  • This PR only adds tooling (Makefile + Python test script). No EVM/gas/accounting, consensus, trie, RLP, or validation logic changes.

Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

🤖 Claude Code Review

Now I have enough context. Let me review the PR thoroughly.

Code Review: PR #6151 - Restart Stall Reproduction Test

Overview

This PR adds a Python test script to reproduce an intermittent restart stall bug where ethrex stops downloading headers after a restart. The test automates the reproduction using eth-docker with ethrex + Prysm.

Critical Issues

1. Timeout Calculation Bug (restart_stall_test.py:36)

SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 60)) * 60  # default 8h

Issue: The default calculation is wrong. 8 * 60 = 480, then * 60 = 28,800 seconds = 8 hours ✓. But if a user sets SYNC_TIMEOUT=8 (expecting 8 hours), they'll get 8 * 60 = 480 seconds = 8 minutes ✗.

Impact: User-provided environment variables will be interpreted in seconds, not hours, causing premature timeouts.

Fix: Either:

  • Document that env vars must be in seconds, OR
  • Change to: SYNC_TIMEOUT = int(os.environ.get("SYNC_TIMEOUT", 8 * 3600))

Same issue exists for:

  • BLOCK_PROCESSING_DURATION (line 36) - mixes minutes and seconds
  • RESTART_STALL_TIMEOUT (line 37)
  • NODE_STARTUP_TIMEOUT (line 38)

2. Race Condition in Block Progress Monitoring (restart_stall_test.py:274-290)

if block is None:
    if time.time() - last_block_time > stall_timeout:
        print(f"  Node stopped responding for {fmt_time(stall_timeout)}")
        return False, last_block - initial_block
elif block > last_block:
    last_block = block
    last_block_time = time.time()
elif time.time() - last_block_time > stall_timeout:
    print(f"  Block stalled at {last_block} for {fmt_time(stall_timeout)}")
    return False, last_block - initial_block

Issue: When block is None but hasn't exceeded stall_timeout, the code continues without updating last_status_print or sleeping. This can lead to tight loops if RPC becomes intermittently unavailable.

Recommendation: Add consistent sleep interval even when block is None.

3. Incomplete Error Handling in configure_eth_docker (restart_stall_test.py:43-101)

with open(default_env) as f:
    lines = f.readlines()

Issue: No exception handling for file I/O operations. If default.env is unreadable (permissions), the script crashes without cleanup.

Recommendation: Add try-except around file operations with meaningful error messages.

Medium Priority Issues

4. Hardcoded Dockerfile Name (restart_stall_test.py:63)

"ETHREX_DOCKERFILE": "Dockerfile.binary",

Issue: This hardcodes a specific Dockerfile. If the project renames this file or uses a different build strategy, the test breaks silently.

Recommendation: Make this configurable or verify the file exists in ethrex repo.

5. Silent Slack Notification Failures (restart_stall_test.py:196)

try:
    requests.post(url, json={"blocks": blocks}, timeout=10)
except Exception:
    pass

Issue: All Slack failures are silently swallowed. Users won't know if notifications failed due to network issues, invalid webhook URLs, or API errors.

Recommendation: Log failures at minimum: except Exception as e: print(f" [Slack notification failed: {e}]")

6. Logs Collection Can Fail Silently (restart_stall_test.py:217)

except Exception as e:
    print(f"  Failed to save {service} logs: {e}")

Good: Error is printed. Concern: After a stall detection, logs are critical for debugging. Script continues even if log collection fails, potentially losing valuable diagnostic data.

Recommendation: Consider warning users more prominently or making log collection failures non-silent in CI environments.

7. Potential Docker Zombie Processes

The script doesn't ensure cleanup on Ctrl+C or exceptions. If the script crashes during Phase 1 or 2, eth-docker containers may be left running.

Recommendation: Add signal handlers or use try/finally blocks to ensure ethd down or at least print instructions for manual cleanup.

Minor Issues / Style

8. Type Safety with rpc_call Return (restart_stall_test.py:138-144)

def rpc_call(url: str, method: str, params=None):
    try:
        # ...
        return resp.json().get("result")
    except Exception:
        return None

Issue: Returns None on any exception (network, JSON parsing, missing "result" key). Callers can't distinguish between "sync complete" (False) and "error" (None).

Impact: In wait_for_sync (line 243), the logic treats None (error) differently than False (synced), which is good. But the broad except Exception masks all errors (DNS, timeout, invalid JSON).

Recommendation: Consider more specific exception handling or logging errors for debugging.

9. Inconsistent Time Units in Comments

  • Line 36: 8 * 60)) * 60 # default 8h - confusing mix of multiplication
  • Line 36: 22 * 60)) # default 22m - units in seconds but comment says minutes

Recommendation: Standardize comments to match actual units after calculation.

10. Magic Number: 15 Minute Stall Detection (restart_stall_test.py:331)

elif time.time() - last_progress_time > RESTART_STALL_TIMEOUT:

But the timeout is defined as 15 minutes (line 37). For header download stalls in Ethereum, 15 minutes seems reasonable, but this should be justified or made more configurable based on network conditions.

11. Makefile: Missing Default Value Documentation

RESTART_TEST_NETWORK ?= hoodi
RESTART_TEST_COUNT ?= 3
ETH_DOCKER_DIR ?= $(HOME)/eth-docker

Good: Uses sensible defaults. Minor: The Makefile help text doesn't document these variables. Users might not know they can override them.

Security Considerations

12. Shell Injection Risk in subprocess.run (restart_stall_test.py:156-168)

def ethd(eth_docker_dir: str, *args) -> subprocess.CompletedProcess:
    cmd = ["./ethd"] + list(args)
    return subprocess.run(cmd, cwd=eth_docker_dir, capture_output=True, text=True)

Assessment: ✓ Safe - uses list form of subprocess (no shell=True), so no injection risk.

13. Environment Variable Injection

The script loads .env files and passes values to eth-docker config. If .env contains malicious content, it could be passed to Docker commands.

Assessment: Low risk - eth-docker is a trusted tool, and users control their own .env. But worth noting for defense-in-depth.

Positive Aspects

Well-structured phased approach - Clean separation of sync vs restart testing
Comprehensive logging - Saves logs after each phase
Good timeout configurability - All timeouts exposed as env vars
Proper use of subprocess - No shell injection vulnerabilities
Detailed progress reporting - Users can monitor test progress
Slack integration - Good for CI/CD notification
Git context in notifications - Includes branch/commit for debugging

Recommendations Summary

  1. CRITICAL: Fix timeout calculation inconsistency (issue docs: add milestones #1)
  2. HIGH: Add error handling for file operations (issue build: add Github actions boilerplate #3)
  3. MEDIUM: Log Slack notification failures (issue feat: add basic RPC api. #5)
  4. MEDIUM: Add cleanup handlers for Ctrl+C (issue Add Dockerfile #7)
  5. LOW: Improve error visibility in RPC calls (issue build: add Dockerfile. #8)
  6. LOW: Document Makefile variables (issue feat: add fmt, test and clippy to CI #11)

Ethereum-Specific Considerations

  • Header Download Stall: The test correctly targets a real issue in sync clients. 15-minute stall timeout is reasonable for detecting genuine problems vs. temporary network hiccups.
  • Consensus + Execution Split: Good test design - only restarts execution layer while keeping consensus running, which mirrors real restart scenarios.
  • RPC Monitoring: Uses standard eth_syncing and eth_blockNumber - appropriate for detecting stalls.

Overall Assessment

The code is functionally sound and addresses a real problem with automated testing. The main issues are around edge case handling (timeouts, errors) rather than core logic. With the critical timeout fix and improved error handling, this is a valuable addition to the tooling suite.

Recommendation: Request changes for issue #1 (timeout calculation), then approve after fix.


Automated review by Claude (Anthropic) · custom prompt

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated reproduction script to detect intermittent ethrex restart stalls when running inside eth-docker, plus Makefile targets to run it consistently from tooling/sync.

Changes:

  • Introduce tooling/sync/restart_stall_test.py to orchestrate a fresh sync, then perform repeated execution-client restarts while monitoring RPC block progress and saving logs.
  • Add restart-stall-test and restart-stall-test-skip-sync Makefile targets to run the script with standard defaults.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
tooling/sync/restart_stall_test.py New eth-docker-based restart-stall reproduction script with RPC polling, log capture, and optional Slack notifications.
tooling/sync/Makefile Adds convenience targets and variables for running the restart-stall test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ompts

The ./ethd terminate command has an interactive Yes/No confirmation that
blocks the script when running non-interactively in tmux. Replace it with
docker compose down -v and docker compose up -d which work without prompts.
eth-docker doesn't publish the EL RPC port to the host by default.
Adding el-shared.yml maps port 8545 to localhost so the monitoring
script can poll eth_syncing and eth_blockNumber via RPC.
@greptile-apps
Copy link

greptile-apps bot commented Feb 6, 2026

Greptile Overview

Greptile Summary

  • Adds a new tooling/sync/restart_stall_test.py script to reproduce an intermittent “stall after restart” issue when running ethrex under eth-docker (ethrex EL + Prysm CL).
  • Extends tooling/sync/Makefile with targets to run the script end-to-end (configure eth-docker, fresh sync, then repeated restart cycles) or skip the initial sync.
  • Script drives eth-docker via ./ethd and docker compose, monitors ethrex via JSON-RPC (eth_syncing, eth_blockNumber), saves logs per phase/restart, and optionally posts Slack notifications.

Confidence Score: 3/5

  • This PR is mergeable after fixing a few correctness issues in the test script that can lead to misleading results.
  • The changes are isolated to tooling, but the new script currently has inconsistent timeout semantics and does not check subprocess failures, which can cause false stall/unresponsive reports and unreliable automation output.
  • tooling/sync/restart_stall_test.py

Important Files Changed

Filename Overview
tooling/sync/Makefile Adds restart-stall-test and restart-stall-test-skip-sync targets to run the new eth-docker-based reproduction script with configurable network/count/eth-docker directory.
tooling/sync/restart_stall_test.py Introduces a Python automation script for reproducing restart-related sync stalls in eth-docker; main issues are inconsistent timeout semantics and lack of failure checking for subprocess calls/log collection assumptions.

Sequence Diagram

sequenceDiagram
    participant User
    participant Make as Makefile (tooling/sync)
    participant Script as restart_stall_test.py
    participant EthD as eth-docker (./ethd)
    participant Docker as docker compose
    participant RPC as Ethrex JSON-RPC
    participant Slack as Slack Webhook
    participant FS as Local FS (logs)

    User->>Make: make restart-stall-test
    Make->>Script: python3 restart_stall_test.py --configure ...

    alt --configure
        Script->>EthD: read default.env
        Script->>EthD: write .env overrides (COMPOSE_FILE, NETWORK, ...)
    end

    alt Phase 1 (unless --skip-phase1)
        Script->>EthD: ./ethd terminate
        Script->>EthD: ./ethd up
        loop until NODE_STARTUP_TIMEOUT
            Script->>RPC: eth_blockNumber
            RPC-->>Script: blockNumber/timeout
        end
        loop until SYNC_TIMEOUT
            Script->>RPC: eth_syncing
            RPC-->>Script: false/obj/timeout
        end
        loop for BLOCK_PROCESSING_DURATION
            Script->>RPC: eth_blockNumber
            RPC-->>Script: increasing/stall
        end
        Script->>Docker: docker compose logs execution/consensus
        Docker-->>Script: logs
        Script->>FS: write *_phase1.log
        Script->>Slack: notify phase1 complete/fail
    end

    loop restart_count
        Script->>Docker: docker compose stop execution
        Script->>Docker: docker compose start execution
        loop until NODE_STARTUP_TIMEOUT
            Script->>RPC: eth_blockNumber
            RPC-->>Script: blockNumber/timeout
        end
        loop until restart monitor timeout
            Script->>RPC: eth_syncing
            Script->>RPC: eth_blockNumber
            RPC-->>Script: progress/stall/unresponsive
        end
        Script->>Docker: docker compose logs execution/consensus
        Docker-->>Script: logs
        Script->>FS: write *_restartN.log
        Script->>Slack: notify on stall/unresponsive
    end

    Script->>FS: write summary.txt
    Script->>Slack: final summary
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Pass PYTHONUNBUFFERED=1 so output appears immediately when piped to tee
in tmux. Add RESTART_TEST_FEE_RECIPIENT variable for the Ethereum address.
- Fix docstring to say prysm.yml (not lighthouse.yml) matching actual config
- Anchor LOGS_DIR to script directory instead of cwd
- Standardize SYNC_TIMEOUT to seconds (was minutes * 60, now direct seconds)
- Extract BLOCK_STALL_TIMEOUT constant (was hard-coded 10*60)
- Add check parameter to docker_compose_in_ethd, fail fast on non-zero exit
- Retry initial block number fetch in wait_for_block_progress instead of or 0
- Pass stall_timeout as parameter to monitor_restart_for_stall instead of
  using global RESTART_STALL_TIMEOUT, fixing the mismatch between function
  timeout and stall detection threshold
- Distinguish "timeout" (still progressing) from "stall" (no progress) in
  monitor_restart_for_stall return values
- Log Slack notification failures instead of bare except pass
- Use docker_compose_in_ethd for save_ethd_logs to ensure correct cwd/context
The configure_eth_docker function was not setting CHECKPOINT_SYNC_URL, causing
Prysm to use the default.env value (hoodi) even when NETWORK=mainnet, resulting
in a fatal fork mismatch and crash loop.
Slack webhooks are loaded regardless of where the script is launched from.
can optionally wipe all data volumes (EL, consensus, validator) and
force a fresh snap sync from scratch. Includes wipe_data_volumes()
helper that removes containers and volumes while preserving JWT,
and a restart-stall-test-wipe Makefile target.
cycle now wipes all volumes and forces a fresh snap sync by default.
Add --keep-data flag for the old stop/start behavior without wipe.
cycle wipes volumes, snap syncs from scratch, and verifies block
progress. Use --restart-count N to limit cycles. Summary is printed
and saved on exit.
return "unresponsive", f"Node never responded after {fmt_time(elapsed)}"

# Monitor: is it syncing? Is it making progress?
last_block = rpc_block_number(rpc_url) or 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rpc_block_number(rpc_url) or 0 conflates two states: node at block 0 (valid) and RPC unreachable (returns None). With the block-0 fix in rpc_block_number, this should use explicit None check:

last_block = rpc_block_number(rpc_url)
if last_block is None:
    last_block = 0

docker_compose_in_ethd(eth_docker_dir, "rm", "-f", "-s", "execution", "consensus", check=True)

project = os.path.basename(eth_docker_dir)
volumes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker Compose normalizes the project name: it lowercases it, strips non-alphanumeric characters (except - and _), and replaces path separators. os.path.basename returns the raw directory name, so if it contains uppercase or special chars (e.g. Eth-Docker → project eth-docker), the volume names won't match and removal silently fails.

Suggestion: use docker compose config --format json to get the actual project name, or normalize to match Compose's behavior:

import re
project = re.sub(r'[^a-z0-9_-]', '', os.path.basename(eth_docker_dir).lower())

Or list/inspect the actual volumes:

result = subprocess.run(["docker", "volume", "ls", "--filter", f"name={project}_", "--format", "{{.Name}}"], capture_output=True, text=True)


def rpc_block_number(url: str):
result = rpc_call(url, "eth_blockNumber")
if result:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int("0x0", 16) is 0, which is falsy — so when the node is at block 0 (e.g. right after a wipe+restart), this returns None as if the RPC call failed.

Suggestion:

def rpc_block_number(url: str):
    result = rpc_call(url, "eth_blockNumber")
    if result is not None:
        return int(result, 16)
    return None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants