-
Notifications
You must be signed in to change notification settings - Fork 382
fix(chain): add rpc retry logic #5341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| for attempt := 0; attempt <= r.maxRetries; attempt++ { | ||
| if attempt > 0 { | ||
| backoffDuration := min(r.backoff*time.Duration(1<<uint(attempt-1)), 5*time.Second) | ||
| time.Sleep(backoffDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of sleep, use time.After and req.Context in select to return as soon as possible if context is done.
| // Reset request body for retry if possible | ||
| if req.Body != nil && req.GetBody != nil { | ||
| var err error | ||
| req.Body, err = req.GetBody() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be problematic if the Body is already read or closed. There is a check below for non-rewritable body, so it should be ok, I suppose.
In general, only idempotent requests should be retried, but in case of rpc, identifying idempotent requests are not easy. Getting information is ok, but posting information may present a problem by duplicating them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is the simpliest to increase the values for the timeouts on the RPC side?
These align with HAProxy timeouts to prevent errors, but they look low?
func newHTTPClient() *http.Client {
return &http.Client{
Transport: &http.Transport{
DialContext: (&net.Dialer{
Timeout: 9 * time.Second, // < HAProxy timeout connect (10s)
KeepAlive: 5 * time.Second, // Keep connections alive but close before idle timeout
}).DialContext,
IdleConnTimeout: 8 * time.Second, // < HAProxy timeout client (10s)
TLSHandshakeTimeout: 9 * time.Second, // < HAProxy timeout connect (10s)
ResponseHeaderTimeout: 9 * time.Second, // < HAProxy timeout server (10s)
},
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, even reties would make the rpc endpoint do more work in this case. Increasing the timeout would be better in this case.
| } | ||
|
|
||
| // isRetryableError determines if an error should trigger a retry. | ||
| func isRetryableError(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these checks for error codes compatible with windows ?
AFAIK that windows have different error codes, maybe some unit tests can reveal it ?
| } | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("request failed after %d retries: %w", r.maxRetries, lastErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop above runs for attempt in 0 to maxRetries, for example with maxRetries = 3, there are 4 attempts, but the error says 3 retries.
|
Take a look on this package, it may help simplifying the implementation: |
Checklist
Description
The bee node was experiencing intermittent connection failures when communicating with blockchain RPC endpoints (especially HAProxy-backed services), resulting in errors like:
These EOF errors occur due to short timeout configurations on HAProxy which can prematurely close connections during RPC communication. Without retry logic, these transient network errors (EOF, connection resets) would cause operations to fail immediately, impacting node stability and reliability.
Implemented a custom retryRoundTripper that wraps the HTTP transport with retry logic
Key Features:
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):