Skip to content

fix(batcher): stop listening to blocks when one of the rpcs disconnects #1961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 4, 2025

Conversation

MarcosNicolau
Copy link
Member

@MarcosNicolau MarcosNicolau commented Jun 3, 2025

Description

This PR improves the batcher's ws connection. The batcher maintains two ws connections: a primary and a fallback. Previously, if either connection failed during listen_to_new_blocks, the entire process would fail.

With this pr now:

  • The batcher only returns an error if both connections fail. If at least one succeeds, the process continues.
  • Previously, a select call would return immediately on the first event, when one disconnected then it would be the first one to return making it fail. The new logic listens to both connections and only exits if both fail.

The process now is the following:

  1. Attempts to connect to both nodes.
  2. Listens for new blocks from both connections.
  3. If one connection fails, it continues listening on the other.
  4. If both fail, it retries the connection process from step 1.

Note: when one fails then the connection we don't try to reconnect until both have failed. Adding this logic is not trivial at all as we would need to create s new process that handles it in the background and deal with mutex, etc.

How to test

  1. Start ethereum-pacakge:
make ethereum_package_start
  1. Start batcher:
make batcher_start_ethereum_package
  1. Locate the rpcs container ids in docker handled by ethereum-package:
docker ps

# You should be looking for:
5ae8707b5dbf   ghcr.io/paradigmxyz/reth:latest                       "/usr/local/bin/reth…"   24 minutes ago   Up 15 minutes   0.0.0.0:8549->8549/tcp, 0.0.0.0:8549->8549/udp, 30303/tcp, 30303/udp, 0.0.0.0:8552->8545/tcp, 0.0.0.0:8553->8546/tcp, 0.0.0.0:8550->8551/tcp, 0.0.0.0:8551->9001/tcp   el-2-reth-lighthouse--3e21610d756d4ef588441314a9733ca1
180a0bcae8c2   ghcr.io/paradigmxyz/reth:latest                       "/usr/local/bin/reth…"   24 minutes ago   Up 19 minutes   0.0.0.0:8542->8542/tcp, 0.0.0.0:8545-8546->8545-8546/tcp, 0.0.0.0:8542->8542/udp, 30303/tcp, 30303/udp, 0.0.0.0:8543->8551/tcp, 0.0.0.0:8544->9001/tcp                 el-1-reth-lighthouse--9b08863b6524476092e0770690968857
  1. Play around with starting and stopping the containers you should see the behavior explained above.

Type of change

  • Bug fix

Checklist

  • “Hotfix” to testnet, everything else to staging
  • Linked to Github Issue
  • This change depends on code or research by an external entity
    • Acknowledgements were updated to give credit
  • Unit tests added
  • This change requires new documentation.
    • Documentation has been added/updated.
  • This change is an Optimization
    • Benchmarks added/run
  • Has a known issue
  • If your PR changes the Operator compatibility (Ex: Upgrade prover versions)
    • This PR adds compatibility for operator for both versions and do not change batcher/docs/examples
    • This PR updates batcher and docs/examples to the newer version. This requires the operator are already updated to be compatible

@MarcosNicolau MarcosNicolau self-assigned this Jun 3, 2025
@JuArce JuArce requested a review from Copilot June 3, 2025 18:05
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the batcher’s websocket behavior so that it only errors out when both primary and fallback connections fail. It revises the retry constants and reworks the block subscription logic by replacing tokio::select! with a join! based approach to concurrently await both streams.

  • Updated ETHEREUM_CALL_MAX_RETRY_DELAY from 3600 to 60 seconds.
  • Modified block subscription logic to use join! for awaiting responses from both primary and fallback providers.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
batcher/aligned-sdk/src/common/constants.rs Adjusted constant values supporting Ethereum call retry logic.
batcher/aligned-batcher/src/lib.rs Changed the mechanism for listening to new blocks with a join! based approach.
Comments suppressed due to low confidence (2)

batcher/aligned-sdk/src/common/constants.rs:44

  • The reduction of the retry delay from 3600 to 60 seconds is a significant change. Please add a comment or documentation explaining the rationale behind this new value to help maintainers understand its impact.
pub const ETHEREUM_CALL_MAX_RETRY_DELAY: u64 = 60; // seconds

batcher/aligned-batcher/src/lib.rs:374

  • Using join! here waits for both streams to respond, which might cause delays if one stream is slow or unresponsive. Consider using tokio::select! so that the code can process a block as soon as either stream provides one.
let (block_main, block_fallback) = join!( ... );

@MauroToscano
Copy link
Contributor

Code seems fine

@MauroToscano MauroToscano added this pull request to the merge queue Jun 4, 2025
Merged via the queue into staging with commit 6ec1dc6 Jun 4, 2025
3 checks passed
@MauroToscano MauroToscano deleted the fix/batcher-rpc-fallback branch June 4, 2025 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants