add exponential retry mechanism for rpc requests using utility function #593

najeal · 2024-12-12T22:03:18Z

Why this should be merged

Relates to #453

How this works

Functions needing retry are passed to the utility function managing the retry mechanism.

How this was tested

unit test has been added for WithMaxRetries() embedding WithMaxRetriesLog()

How is this documented

cam-schultz

I left a handful of comments, but overall the backoff utility looks good.

There are a couple of other places that exponential backoff would make sense. Can you please integrate the utility in these places as well?

p2p signature aggregation If this proves to be non-trivial, we can defer to a later ticket
The existing call with retry utility and its callsites

cam-schultz · 2024-12-13T18:31:16Z

utils/backoff.go

+// WithMaxRetriesLog runs the operation until it succeeds or max retries has been reached.
+// It uses exponential back off.
+// It optionally logs information if logger is set.
+func WithMaxRetriesLog(


Rather than separate with/without log functions, I think we should instead have a single method that takes a logging.Logger and emits warning logs on retries. I can't think of any use cases where we'd want retry attempts to be logged as Warn for some operations, but not logged at all for others.

cam-schultz · 2024-12-13T18:39:22Z

utils/backoff.go

+	msg string,
+	fields ...zapcore.Field,


I think logging msg and fields on each attempt is not worth the added complexity of having to pass them in as arguments. Rather, WithMaxRetries should emit a generic warning log on each attempt failure, and we can leave it to the caller to construct a cohesive error log in the failure case.

cam-schultz · 2024-12-13T18:42:21Z

utils/backoff.go

+	fields ...zapcore.Field,
+) error {
+	attempt := uint(1)
+	expBackOff := backoff.WithMaxRetries(backoff.NewExponentialBackOff(), max)


Let's use backoff.WithMaxElapsedTime instead. At present, we choose the number of retries and the delay between each to resolve to a set elapsed time before we emit an error.

If I understand well, you want to use backoff.WithMaxElapsedTime instead of backoff.WithMaxRetries.

Instead of giving a maxRetry to the utility function, I will give a maxElapsedTime derived from the existing values (numberOfRetries * delayBetweenEach)

cam-schultz · 2024-12-13T18:43:45Z

relayer/application_relayer.go

-	r.logger.Warn(
-		"Failed to get aggregate signature from node endpoint",
-		zap.Int("attempts", maxRelayerQueryAttempts),
+	err = utils.WithMaxRetriesLog(


Just a head's up, the Warp API integration is planned to be deprecated in the near future. It's still reasonable to integrate exponential backoff here for the time being.

najeal · 2024-12-20T16:38:49Z

signature-aggregator/aggregator/aggregator.go

@@ -309,10 +311,10 @@ func (s *SignatureAggregator) CreateSignedMessage(
 				if err != nil {
 					// don't increase node failures metric here, because we did
 					// it in handleResponse
-					return nil, fmt.Errorf(
+					return backoff.Permanent(fmt.Errorf(


backoff.Permanent will prevent the backoff to retry

najeal · 2024-12-20T16:40:57Z

utils/backoff_test.go

@@ -15,7 +17,9 @@ func TestWithMaxRetries(t *testing.T) {
 				_, err = retryable.Run()
 				return err
 			},
-			2,
+			// using default values: we want to run max 2 tries.
+			624*time.Millisecond,


the value if coming from provided table : https://github.com/cenkalti/backoff/blob/720b78985a65c0452fd37bb155c7cac4157a7c45/exponential.go#L39-L50

najeal · 2024-12-20T16:42:40Z

types/types.go

 			})
+			return err
+		}
+		err = utils.WithMaxRetries(operation, 5*time.Second, logger)


I could use utils.DefaultRPCRetryTimeout but variable name is confusing as it is not related to backoff strategy

It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout to DefaultRPCTimeout or similar and use that here.

cam-schultz

I left a handful of minor comments, but overall this looks great!

cam-schultz · 2024-12-20T21:32:20Z

messages/teleporter/message_handler.go

+		receipt, err = destinationClient.Client().(ethclient.Client).TransactionReceipt(callCtx, txHash)
+		return err
+	}
+	err := utils.WithMaxRetries(operation, 30*time.Second, m.logger)


Let's replace 30*time.Second with a constant default timeout variable named DefaultBlockAcceptanceTimeout or similar. I think it makes the most sense to define this const here in the teleporter package, since it's specific to this use case.

cam-schultz · 2024-12-20T21:51:36Z

types/types.go

 			})
+			return err
+		}
+		err = utils.WithMaxRetries(operation, 5*time.Second, logger)


It's still a good idea to use a variable rather than a hardcoded time. Let's rename DefaultRPCRetryTimeout to DefaultRPCTimeout or similar and use that here.

cam-schultz · 2024-12-20T21:57:31Z

utils/backoff.go

+	operation backoff.Operation,
+	maxElapsedTime time.Duration,
+	logger logging.Logger,


nit: We typically pass the logging.Logger as the first argument to functions that accept one.

cam-schultz · 2024-12-20T21:58:09Z

utils/backoff.go

+
+// WithMaxRetries uses an exponential backoff to run the operation until it
+// succeeds or max elapsed time has been reached.
+func WithMaxRetries(


Let's rename this to reflect that we're retrying over a specified time interval, rather than a fixed number of retry attempts.

cam-schultz · 2024-12-20T22:01:44Z

vms/evm/subscriber.go

 	MaxBlocksPerRequest         = 200
-	rpcMaxRetries               = 5
+	retryMaxElapsedTime         = 5 * time.Second


Let's rename this resubscribeMaxElapsedTime to reflect its usage.

cam-schultz · 2024-12-20T22:02:19Z

vms/evm/subscriber.go

-		s.logger.Warn(
+		return err
+	}
+	err = utils.WithMaxRetries(operation, retryMaxElapsedTime, s.logger)


Let's use a default rpc retry timeout specified in the utils package, that I suggested adding in another comment.

cam-schultz · 2024-12-20T22:02:41Z

vms/evm/subscriber.go

-func (s *subscriber) subscribe() error {
-	sub, err := s.wsClient.SubscribeNewHead(context.Background(), s.headers)
+// subscribe until it succeeds or reached maxSubscribeAttempts.
+func (s *subscriber) subscribe(maxSubscribeAttempts uint64) error {


Looks like maxSubscribeAttempts is unused.

najeal · 2024-12-23T12:16:00Z

relayer/application_relayer.go

-	// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
-	// during relayer signature query routine
-	signatureRequestRetryWaitPeriodMs = 10_000
+	retryMaxElapsedTime = 10 * time.Second


I think we can rename the consts using MaxElapsedTime (ex: retryMaxElapsedTime -> retryTimeout) as we are now using utils.DefaultRPCTimeout. Otherwise we are mixing the use of Timeout and MaxElapsedTime when calling the function.
What do you think @cam-schultz ?

Sounds good to me! My main concern is that we have distinct consts for the various timeout scenarios.

I have changed consts as Timeout and kept distinct ones 👍

geoff-vball · 2024-12-27T17:51:24Z

relayer/application_relayer.go

-	// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
-	// during relayer signature query routine
-	signatureRequestRetryWaitPeriodMs = 10_000
+	retryTimeout = 10 * time.Second


We were making a max of 5 attempts with 10s in between each. Can we make the timeout 60s to match closer?

I think we were waiting time.Duration(signatureRequestRetryWaitPeriodMs/maxRelayerQueryAttempts) * time.Millisecond
-> (10_000 / 5) milliseconds -> 2_000 milliseconds each round

geoff-vball · 2024-12-27T17:58:15Z

signature-aggregator/aggregator/aggregator.go

 	// Maximum amount of time to spend waiting (in addition to network round trip time per attempt)
 	// during relayer signature query routine
-	signatureRequestRetryWaitPeriodMs = 20_000
+	signatureRequestTimeout = 20 * time.Second


Same here, we were waiting 20s between checks, which we did 10 times, so lets make this ~200s

It is the same calcul here

najeal added 2 commits December 12, 2024 22:45

add exponential retry mechanism for rpc requests using utility function

eb7887f

add tests to retry mechanism

430fcaf

najeal requested a review from a team as a code owner December 12, 2024 22:03

najeal requested review from iansuvak, richardpringle, geoff-vball, bernard-avalabs, michaelkaplan13 and cam-schultz December 12, 2024 22:03

cam-schultz reviewed Dec 13, 2024

View reviewed changes

najeal added 3 commits December 20, 2024 17:01

update backoff mechanism + use it in additional places

43fc8b2

backoff fix test

b266ed6

Merge remote-tracking branch 'upstream/main' into retry-exponential

58d18ec

najeal commented Dec 20, 2024

View reviewed changes

najeal requested a review from cam-schultz December 20, 2024 16:43

cam-schultz reviewed Dec 20, 2024

View reviewed changes

fix: nits

89dc5b5

najeal commented Dec 23, 2024

View reviewed changes

rename maxElapsedTime consts has timeout

e91355d

najeal requested a review from cam-schultz December 27, 2024 12:55

Merge branch 'main' into retry-exponential

b5dcb34

geoff-vball reviewed Dec 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add exponential retry mechanism for rpc requests using utility function #593

add exponential retry mechanism for rpc requests using utility function #593

najeal commented Dec 12, 2024

cam-schultz left a comment

cam-schultz Dec 13, 2024

cam-schultz Dec 13, 2024

cam-schultz Dec 13, 2024

najeal Dec 13, 2024

cam-schultz Dec 13, 2024

najeal Dec 20, 2024

najeal Dec 20, 2024

najeal Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz left a comment

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

cam-schultz Dec 20, 2024

najeal Dec 23, 2024 •

edited

Loading

cam-schultz Dec 26, 2024

najeal Dec 27, 2024

geoff-vball Dec 27, 2024

najeal Dec 28, 2024

geoff-vball Dec 27, 2024

najeal Dec 28, 2024

add exponential retry mechanism for rpc requests using utility function #593

Are you sure you want to change the base?

add exponential retry mechanism for rpc requests using utility function #593

Conversation

najeal commented Dec 12, 2024

Why this should be merged

How this works

How this was tested

How is this documented

cam-schultz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cam-schultz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

najeal Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

najeal Dec 23, 2024 •

edited

Loading