-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add exponential retry mechanism for rpc requests using utility function #593
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a handful of comments, but overall the backoff utility looks good.
There are a couple of other places that exponential backoff would make sense. Can you please integrate the utility in these places as well?
- p2p signature aggregation If this proves to be non-trivial, we can defer to a later ticket
- The existing call with retry utility and its callsites
utils/backoff.go
Outdated
// WithMaxRetriesLog runs the operation until it succeeds or max retries has been reached. | ||
// It uses exponential back off. | ||
// It optionally logs information if logger is set. | ||
func WithMaxRetriesLog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than separate with/without log functions, I think we should instead have a single method that takes a logging.Logger
and emits warning logs on retries. I can't think of any use cases where we'd want retry attempts to be logged as Warn for some operations, but not logged at all for others.
utils/backoff.go
Outdated
msg string, | ||
fields ...zapcore.Field, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think logging msg
and fields
on each attempt is not worth the added complexity of having to pass them in as arguments. Rather, WithMaxRetries
should emit a generic warning log on each attempt failure, and we can leave it to the caller to construct a cohesive error log in the failure case.
utils/backoff.go
Outdated
fields ...zapcore.Field, | ||
) error { | ||
attempt := uint(1) | ||
expBackOff := backoff.WithMaxRetries(backoff.NewExponentialBackOff(), max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use backoff.WithMaxElapsedTime
instead. At present, we choose the number of retries and the delay between each to resolve to a set elapsed time before we emit an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand well, you want to use backoff.WithMaxElapsedTime
instead of backoff.WithMaxRetries
.
Instead of giving a maxRetry
to the utility function, I will give a maxElapsedTime
derived from the existing values (numberOfRetries * delayBetweenEach
)
relayer/application_relayer.go
Outdated
r.logger.Warn( | ||
"Failed to get aggregate signature from node endpoint", | ||
zap.Int("attempts", maxRelayerQueryAttempts), | ||
err = utils.WithMaxRetriesLog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a head's up, the Warp API integration is planned to be deprecated in the near future. It's still reasonable to integrate exponential backoff here for the time being.
@@ -309,10 +311,10 @@ func (s *SignatureAggregator) CreateSignedMessage( | |||
if err != nil { | |||
// don't increase node failures metric here, because we did | |||
// it in handleResponse | |||
return nil, fmt.Errorf( | |||
return backoff.Permanent(fmt.Errorf( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backoff.Permanent
will prevent the backoff to retry
@@ -15,7 +17,9 @@ func TestWithMaxRetries(t *testing.T) { | |||
_, err = retryable.Run() | |||
return err | |||
}, | |||
2, | |||
// using default values: we want to run max 2 tries. | |||
624*time.Millisecond, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the value if coming from provided table : https://github.com/cenkalti/backoff/blob/720b78985a65c0452fd37bb155c7cac4157a7c45/exponential.go#L39-L50
types/types.go
Outdated
}) | ||
return err | ||
} | ||
err = utils.WithMaxRetries(operation, 5*time.Second, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use utils.DefaultRPCRetryTimeout
but variable name is confusing as it is not related to backoff strategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout
to DefaultRPCTimeout
or similar and use that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a handful of minor comments, but overall this looks great!
receipt, err = destinationClient.Client().(ethclient.Client).TransactionReceipt(callCtx, txHash) | ||
return err | ||
} | ||
err := utils.WithMaxRetries(operation, 30*time.Second, m.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's replace 30*time.Second
with a constant default timeout variable named DefaultBlockAcceptanceTimeout
or similar. I think it makes the most sense to define this const here in the teleporter
package, since it's specific to this use case.
types/types.go
Outdated
}) | ||
return err | ||
} | ||
err = utils.WithMaxRetries(operation, 5*time.Second, logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout
to DefaultRPCTimeout
or similar and use that here.
utils/backoff.go
Outdated
operation backoff.Operation, | ||
maxElapsedTime time.Duration, | ||
logger logging.Logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We typically pass the logging.Logger
as the first argument to functions that accept one.
utils/backoff.go
Outdated
|
||
// WithMaxRetries uses an exponential backoff to run the operation until it | ||
// succeeds or max elapsed time has been reached. | ||
func WithMaxRetries( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to reflect that we're retrying over a specified time interval, rather than a fixed number of retry attempts.
vms/evm/subscriber.go
Outdated
MaxBlocksPerRequest = 200 | ||
rpcMaxRetries = 5 | ||
retryMaxElapsedTime = 5 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this resubscribeMaxElapsedTime
to reflect its usage.
vms/evm/subscriber.go
Outdated
s.logger.Warn( | ||
return err | ||
} | ||
err = utils.WithMaxRetries(operation, retryMaxElapsedTime, s.logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a default rpc retry timeout specified in the utils package, that I suggested adding in another comment.
vms/evm/subscriber.go
Outdated
func (s *subscriber) subscribe() error { | ||
sub, err := s.wsClient.SubscribeNewHead(context.Background(), s.headers) | ||
// subscribe until it succeeds or reached maxSubscribeAttempts. | ||
func (s *subscriber) subscribe(maxSubscribeAttempts uint64) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like maxSubscribeAttempts
is unused.
relayer/application_relayer.go
Outdated
// Maximum amount of time to spend waiting (in addition to network round trip time per attempt) | ||
// during relayer signature query routine | ||
signatureRequestRetryWaitPeriodMs = 10_000 | ||
retryMaxElapsedTime = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rename the consts using MaxElapsedTime
(ex: retryMaxElapsedTime
-> retryTimeout
) as we are now using utils.DefaultRPCTimeout
. Otherwise we are mixing the use of Timeout
and MaxElapsedTime
when calling the function.
What do you think @cam-schultz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me! My main concern is that we have distinct const
s for the various timeout scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed consts as Timeout
and kept distinct ones 👍
// Maximum amount of time to spend waiting (in addition to network round trip time per attempt) | ||
// during relayer signature query routine | ||
signatureRequestRetryWaitPeriodMs = 10_000 | ||
retryTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were making a max of 5 attempts with 10s in between each. Can we make the timeout 60s to match closer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we were waiting time.Duration(signatureRequestRetryWaitPeriodMs/maxRelayerQueryAttempts) * time.Millisecond
-> (10_000 / 5) milliseconds
-> 2_000
milliseconds each round
// Maximum amount of time to spend waiting (in addition to network round trip time per attempt) | ||
// during relayer signature query routine | ||
signatureRequestRetryWaitPeriodMs = 20_000 | ||
signatureRequestTimeout = 20 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we were waiting 20s between checks, which we did 10 times, so lets make this ~200s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same calcul here
Why this should be merged
Relates to #453
How this works
Functions needing retry are passed to the utility function managing the retry mechanism.
How this was tested
unit test has been added for
WithMaxRetries()
embeddingWithMaxRetriesLog()
How is this documented