Skip to content

Conversation

@gacevicljubisa
Copy link
Member

@gacevicljubisa gacevicljubisa commented Jan 30, 2026

Checklist

  • I have read the coding guide.
  • My change requires a documentation update, and I have done it.
  • I have added tests to cover my changes.
  • I have filled out the description and linked the related issues.

Description

The bee node was experiencing intermittent connection failures when communicating with blockchain RPC endpoints (especially HAProxy-backed services), resulting in errors like:

"could not get block number" "error"="Post \"http://rpc-sepolia-haproxy...\": EOF"

These EOF errors occur due to short timeout configurations on HAProxy which can prematurely close connections during RPC communication. Without retry logic, these transient network errors (EOF, connection resets) would cause operations to fail immediately, impacting node stability and reliability.

Implemented a custom retryRoundTripper that wraps the HTTP transport with retry logic

Key Features:

  • Automatic retries: Up to 3 retry attempts for transient errors
  • Exponential backoff: 100ms → 200ms → 400ms between retries (capped at 5s)
  • Error detection: Retries on EOF, connection reset, connection refused, and broken pipe errors
  • Request body handling: Rewinds request bodies between retries using GetBody() to avoid content-length mismatches
  • Uses http.DefaultTransport wrapped with retry logic

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

for attempt := 0; attempt <= r.maxRetries; attempt++ {
if attempt > 0 {
backoffDuration := min(r.backoff*time.Duration(1<<uint(attempt-1)), 5*time.Second)
time.Sleep(backoffDuration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of sleep, use time.After and req.Context in select to return as soon as possible if context is done.

// Reset request body for retry if possible
if req.Body != nil && req.GetBody != nil {
var err error
req.Body, err = req.GetBody()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be problematic if the Body is already read or closed. There is a check below for non-rewritable body, so it should be ok, I suppose.

In general, only idempotent requests should be retried, but in case of rpc, identifying idempotent requests are not easy. Getting information is ok, but posting information may present a problem by duplicating them.

Copy link
Member Author

@gacevicljubisa gacevicljubisa Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is the simpliest to increase the values for the timeouts on the RPC side?
These align with HAProxy timeouts to prevent errors, but they look low?

func newHTTPClient() *http.Client {
    return &http.Client{
        Transport: &http.Transport{
            DialContext: (&net.Dialer{
                Timeout:   9 * time.Second, // < HAProxy timeout connect (10s)
                KeepAlive: 5 * time.Second, // Keep connections alive but close before idle timeout
            }).DialContext,
            IdleConnTimeout:       8 * time.Second, // < HAProxy timeout client (10s)
            TLSHandshakeTimeout:   9 * time.Second, // < HAProxy timeout connect (10s)
            ResponseHeaderTimeout: 9 * time.Second, // < HAProxy timeout server (10s)
        },
    }
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, even reties would make the rpc endpoint do more work in this case. Increasing the timeout would be better in this case.

}

// isRetryableError determines if an error should trigger a retry.
func isRetryableError(err error) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these checks for error codes compatible with windows ?
AFAIK that windows have different error codes, maybe some unit tests can reveal it ?

}
}

return nil, fmt.Errorf("request failed after %d retries: %w", r.maxRetries, lastErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop above runs for attempt in 0 to maxRetries, for example with maxRetries = 3, there are 4 attempts, but the error says 3 retries.

@akrem-chabchoub
Copy link
Contributor

akrem-chabchoub commented Feb 2, 2026

Take a look on this package, it may help simplifying the implementation:
https://github.com/hashicorp/go-retryablehttp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants