Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add exponential retry mechanism for rpc requests using utility function #593

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

najeal
Copy link
Contributor

@najeal najeal commented Dec 12, 2024

Why this should be merged

Relates to #453

How this works

Functions needing retry are passed to the utility function managing the retry mechanism.

How this was tested

unit test has been added for WithMaxRetries() embedding WithMaxRetriesLog()

How is this documented

Copy link
Collaborator

@cam-schultz cam-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a handful of comments, but overall the backoff utility looks good.

There are a couple of other places that exponential backoff would make sense. Can you please integrate the utility in these places as well?

utils/backoff.go Outdated
// WithMaxRetriesLog runs the operation until it succeeds or max retries has been reached.
// It uses exponential back off.
// It optionally logs information if logger is set.
func WithMaxRetriesLog(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than separate with/without log functions, I think we should instead have a single method that takes a logging.Logger and emits warning logs on retries. I can't think of any use cases where we'd want retry attempts to be logged as Warn for some operations, but not logged at all for others.

utils/backoff.go Outdated
Comment on lines 19 to 20
msg string,
fields ...zapcore.Field,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging msg and fields on each attempt is not worth the added complexity of having to pass them in as arguments. Rather, WithMaxRetries should emit a generic warning log on each attempt failure, and we can leave it to the caller to construct a cohesive error log in the failure case.

utils/backoff.go Outdated
fields ...zapcore.Field,
) error {
attempt := uint(1)
expBackOff := backoff.WithMaxRetries(backoff.NewExponentialBackOff(), max)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use backoff.WithMaxElapsedTime instead. At present, we choose the number of retries and the delay between each to resolve to a set elapsed time before we emit an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well, you want to use backoff.WithMaxElapsedTime instead of backoff.WithMaxRetries.

Instead of giving a maxRetry to the utility function, I will give a maxElapsedTime derived from the existing values (numberOfRetries * delayBetweenEach)

r.logger.Warn(
"Failed to get aggregate signature from node endpoint",
zap.Int("attempts", maxRelayerQueryAttempts),
err = utils.WithMaxRetriesLog(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a head's up, the Warp API integration is planned to be deprecated in the near future. It's still reasonable to integrate exponential backoff here for the time being.

@@ -309,10 +311,10 @@ func (s *SignatureAggregator) CreateSignedMessage(
if err != nil {
// don't increase node failures metric here, because we did
// it in handleResponse
return nil, fmt.Errorf(
return backoff.Permanent(fmt.Errorf(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoff.Permanent will prevent the backoff to retry

@@ -15,7 +17,9 @@ func TestWithMaxRetries(t *testing.T) {
_, err = retryable.Run()
return err
},
2,
// using default values: we want to run max 2 tries.
624*time.Millisecond,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types/types.go Outdated
})
return err
}
err = utils.WithMaxRetries(operation, 5*time.Second, logger)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use utils.DefaultRPCRetryTimeout but variable name is confusing as it is not related to backoff strategy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout to DefaultRPCTimeout or similar and use that here.

@najeal najeal requested a review from cam-schultz December 20, 2024 16:43
Copy link
Collaborator

@cam-schultz cam-schultz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a handful of minor comments, but overall this looks great!

receipt, err = destinationClient.Client().(ethclient.Client).TransactionReceipt(callCtx, txHash)
return err
}
err := utils.WithMaxRetries(operation, 30*time.Second, m.logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's replace 30*time.Second with a constant default timeout variable named DefaultBlockAcceptanceTimeout or similar. I think it makes the most sense to define this const here in the teleporter package, since it's specific to this use case.

types/types.go Outdated
})
return err
}
err = utils.WithMaxRetries(operation, 5*time.Second, logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout to DefaultRPCTimeout or similar and use that here.

utils/backoff.go Outdated
Comment on lines 13 to 15
operation backoff.Operation,
maxElapsedTime time.Duration,
logger logging.Logger,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We typically pass the logging.Logger as the first argument to functions that accept one.

utils/backoff.go Outdated

// WithMaxRetries uses an exponential backoff to run the operation until it
// succeeds or max elapsed time has been reached.
func WithMaxRetries(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to reflect that we're retrying over a specified time interval, rather than a fixed number of retry attempts.

MaxBlocksPerRequest = 200
rpcMaxRetries = 5
retryMaxElapsedTime = 5 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this resubscribeMaxElapsedTime to reflect its usage.

s.logger.Warn(
return err
}
err = utils.WithMaxRetries(operation, retryMaxElapsedTime, s.logger)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a default rpc retry timeout specified in the utils package, that I suggested adding in another comment.

func (s *subscriber) subscribe() error {
sub, err := s.wsClient.SubscribeNewHead(context.Background(), s.headers)
// subscribe until it succeeds or reached maxSubscribeAttempts.
func (s *subscriber) subscribe(maxSubscribeAttempts uint64) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like maxSubscribeAttempts is unused.

// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
// during relayer signature query routine
signatureRequestRetryWaitPeriodMs = 10_000
retryMaxElapsedTime = 10 * time.Second
Copy link
Contributor Author

@najeal najeal Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can rename the consts using MaxElapsedTime (ex: retryMaxElapsedTime -> retryTimeout) as we are now using utils.DefaultRPCTimeout. Otherwise we are mixing the use of Timeout and MaxElapsedTime when calling the function.
What do you think @cam-schultz ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me! My main concern is that we have distinct consts for the various timeout scenarios.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed consts as Timeout and kept distinct ones 👍

@najeal najeal requested a review from cam-schultz December 27, 2024 12:55
// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
// during relayer signature query routine
signatureRequestRetryWaitPeriodMs = 10_000
retryTimeout = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were making a max of 5 attempts with 10s in between each. Can we make the timeout 60s to match closer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were waiting time.Duration(signatureRequestRetryWaitPeriodMs/maxRelayerQueryAttempts) * time.Millisecond
-> (10_000 / 5) milliseconds -> 2_000 milliseconds each round

// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
// during relayer signature query routine
signatureRequestRetryWaitPeriodMs = 20_000
signatureRequestTimeout = 20 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we were waiting 20s between checks, which we did 10 times, so lets make this ~200s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same calcul here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants