Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Nethermind error messages #7999

Merged
merged 2 commits into from
Feb 9, 2023
Merged

Conversation

dimriou
Copy link
Collaborator

@dimriou dimriou commented Nov 25, 2022

On this PR(NethermindEth/nethermind#3626) Nethermind extended its error messages. This PR adds the new regexes and test cases.

@github-actions
Copy link
Contributor

I see that you haven't updated any CHANGELOG files. Would it make sense to do so?

@dimriou dimriou force-pushed the sc-58348/extend_nethermind_errors branch 2 times, most recently from 79aba0f to 902c7e9 Compare November 28, 2022 17:25
@dimriou dimriou marked this pull request as ready for review November 28, 2022 17:58
jmank88
jmank88 previously approved these changes Nov 29, 2022
jmank88
jmank88 previously approved these changes Nov 30, 2022
@@ -167,13 +167,16 @@ var klaytn = ClientErrors{
// Nethermind
// All errors: https://github.com/NethermindEth/nethermind/blob/master/src/Nethermind/Nethermind.TxPool/AcceptTxResult.cs
// All filters: https://github.com/NethermindEth/nethermind/tree/9b68ec048c65f4b44fb863164c0dec3f7780d820/src/Nethermind/Nethermind.TxPool/Filters
var nethermindFatal = regexp.MustCompile(`(: |^)(SenderIsContract|Invalid|Int256Overflow|FailedToResolveSender|GasLimitExceeded)$`)
var nethermindFatal = regexp.MustCompile(`(: |^)(SenderIsContract|Invalid(, transaction Hash is null)?|Int256Overflow|FailedToResolveSender|GasLimitExceeded(, Gas limit: \d+, gas limit of rejected tx: \d+)?|NonceGap(, Future nonce. Expected nonce: \d+)?)$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NonceGap seems very interesting!
I haven't seen such an error being returned before.

I think we need special handling for this. If we treat this as fatal, bad things can happen in our txm.
For example, we send multiple Txs for the same FromAddresses one after another, so lets say we send the following:
Tx1: nonce 1
Tx2: nonce 2

It is possible that a node sees Tx2 before Tx1. In that case, this is not a fatal error, and we should just retry.

For now, I think we should treat NonceGap as a retryable error.

But this is a very critical issue we need to handle here, or else can cause major problems in some nodes.

@samsondav @jmank88 for a second look

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, other eth clients, when they see a nonce gap, just accept the Tx, and keep it in their mempool, and wait till they see the missing nonce too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I guess we normally rely on the queue/pending node and RPC config to be aligned so this is avoided all together... 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you could configure an RPC endpoint to not return such NonceGap errors?

In any case, we cannot always control RPCs, so should also handle it in code correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that when properly configured, we should never get so far ahead of ourselves that this happens, but yes it would be better to support more cases.
Our docs describe how to align ETH_MAX_QUEUED_TRANSACTIONS/ETH_MAX_IN_FLIGHT_TRANSACTIONS with the RPC node's AccountSlots/GlobalSlots/AccountQueue/GlobaQueue:
https://docs.chain.link/chainlink-nodes/evm-performance-configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the misconfiguration case still possible too? What error is returned for the first tx after the pool is full?

I am thinking now that some of the edge cases we are worried about here would likely only occur on batch (re)broadcasts, since we send to active primaries serially and should halt as soon as the pool is full... 🤔

Copy link
Contributor

@jmank88 jmank88 Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so I think that fatal is probably correct given that this is only applied to the code doing the serial send to the active primary. IIUC the non-fatal, out-of-order case can only happen on other nodes, where we don't react to errors the same way.

Copy link
Collaborator

@samsondav samsondav Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should not be possible for Chainlink to send txes with a nonce gap. I can't say for certain though that there isn't some combination of re-orgs, resending etc where it might not be possible. If it does happen, we woul dexpect to be able to send the higher nonce then "fill in" the gap with lower nonces.

Copy link
Collaborator

@samsondav samsondav Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example here:

https://github.com/smartcontractkit/chainlink/blob/develop/core/chains/evm/txmgr/eth_resender.go#L148

This orders by eth_tx.id, then nonce. Is it possible you could send nonces out of order? I think so. We would need to review all such cases (perhaps this is the only one?).

It should be safe to reverse those ordering terms btw. Might be worth including that along with this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's a bug in Nethermind. go-ethereum supports out-of-order transmissions

@@ -167,13 +167,16 @@ var klaytn = ClientErrors{
// Nethermind
// All errors: https://github.com/NethermindEth/nethermind/blob/master/src/Nethermind/Nethermind.TxPool/AcceptTxResult.cs
// All filters: https://github.com/NethermindEth/nethermind/tree/9b68ec048c65f4b44fb863164c0dec3f7780d820/src/Nethermind/Nethermind.TxPool/Filters
var nethermindFatal = regexp.MustCompile(`(: |^)(SenderIsContract|Invalid|Int256Overflow|FailedToResolveSender|GasLimitExceeded)$`)
var nethermindFatal = regexp.MustCompile(`(: |^)(SenderIsContract|Invalid(, transaction Hash is null)?|Int256Overflow|FailedToResolveSender|GasLimitExceeded(, Gas limit: \d+, gas limit of rejected tx: \d+)?|NonceGap(, Future nonce. Expected nonce: \d+)?)$`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of: SenderIsContract|Invalid(, transaction Hash is null)?
Does it make sense to just have: SenderIsContract|Invalid?

By trying to match the bracket part, are we getting anything extra?

Happy to have you hear from others too on their opinion.

Same comment on all cases below which have the bracketed() part added.

Copy link
Contributor

@jmank88 jmank88 Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have hesitated to make the matches too open ended in the past, specifically prefixes that match any suffix. So we usually enumerate the possibilities and use the trailing $, like here. That internal one looks a little more complicated though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, if you look at the Nethermind PR, and search for this string "transaction Hash is null", (this file: src/Nethermind/Nethermind.TxPool/Filters/NullHashTxFilter.cs), it is clear that the way they return all fatal errors, is to use the prefix, and add (, reason text xyz).
So seems redundant and confusing to add the specific message in our matching string.

Note that even with this internal string, we still are keeping that open ended match. So adding this detailed string doesn't give us any added protection against side effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not redundant. It is a deliberate design choice to match a limited set instead of an open ended one. I can't say whether it is the correct/best/optimal choice, but it is significant and shouldn't be dismissed hastily.

Note that even with this internal string, we still are keeping that open ended match. So adding this detailed string doesn't give us any added protection against side effects.

I don't know what you mean. The internal case of Invalid(, transaction Hash is null)? is actually not as interesting as I first thought - just another instance of being explicit about the match. We only accept the two literals Invalid or Invalid, transaction Hash is null.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@prashantkumar1982 prashantkumar1982 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specially the NonceGap issue needs further discussion, since this. is not an error type which any eth client previously returned. I think right behavior is to retry in those cases.

Copy link
Collaborator

@samsondav samsondav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments

TerminallyUnderpriced: regexp.MustCompile(`(: |^)(FeeTooLow|FeeTooLowToCompete)$`),
TerminallyUnderpriced: regexp.MustCompile(`(: |^)(FeeTooLow(, MaxFeePerGas too low. MaxFeePerGas: \d+, BaseFee: \d+, MaxPriorityFeePerGas:\d+, Block number: \d+|` +
`, EffectivePriorityFeePerGas too low \d+ < \d+, BaseFee: \d+|` +
`, FeePerGas needs to be higher than \d+ to be added to the TxPool. Affordable FeePerGas of rejected tx: \d+.)?|` +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how we are handling all errors, but I just don't like this level of hard matching. Any small string changes to this detailed error message on nethermind will again break us.
Perhaps sometime later we can think of a refactor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is making it looser can accidentally match stuff it isn't supposed to.

e.g. "known transaction" also matches "unknown transaction". This has bitten us in the past hence the extreme conservatism now.

@@ -36,6 +36,9 @@ func (s *SendError) CauseStr() string {

const (
NonceTooLow = iota
// Nethermind specific error. Nethermind throws a NonceGap error when the tx nonce is greater than current_nonce + tx_count_in_mempool, instead of keeping the tx in mempool.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually seems like we already have a bunch of places in this file where "nonce too high" was treated previously as a fatal error.
Just search for the string "nonce too high" in this file.
I see gethFatal, arbitrumFatal, and erigonFatal being added in the last 6 months in different PRs.

Maybe not in this PR, but let us fix those in a follow-up PR.
https://app.shortcut.com/chainlinklabs/story/59049/nonce-too-high-this-error-is-incorrectly-being-treated-as-fatal-in-evm-code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But first we would have to verify that the other clients work the same way as Nethermind does. IIRC Geth doesn't use the nonce gap mechanism Nethermind does, so it might mean something else in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. @simsonraj is working on the other chains/clients as part of a separate ticket.

@dimriou dimriou force-pushed the sc-58348/extend_nethermind_errors branch from 2ce6edd to 284680b Compare December 13, 2022 13:40
@dimriou dimriou force-pushed the sc-58348/extend_nethermind_errors branch 4 times, most recently from c718613 to f7593f9 Compare January 10, 2023 13:06
@dimriou dimriou force-pushed the sc-58348/extend_nethermind_errors branch from f7593f9 to 5b40be4 Compare January 20, 2023 12:57
// This can happen if previous transactions haven't reached the client yet.
// The correct thing to do is assume success for now and let the eth_confirmer retry until
// the nonce gap gets filled by the previous transactions.
lgr.Criticalw("Transaction has a nonce gap.", "err", sendError.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log level should be Warn here, and not critical, since this error could just show up due to network latencies.

@@ -556,6 +556,16 @@ func (eb *EthBroadcaster) handleInProgressEthTx(ctx context.Context, etx EthTx,
return errors.Wrap(sendError, "this error type only handled for L2s"), false
}

if sendError.IsNonceTooHighError() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add a unit-test for this case, as part of the test "TestEthBroadcaster_ProcessUnstartedEthTxs_Errors"?

I see there are already other cases which have tests in that function.

@dimriou dimriou force-pushed the sc-58348/extend_nethermind_errors branch from 5b40be4 to cd91bf6 Compare February 7, 2023 11:03
@cl-sonarqube-production
Copy link

SonarQube Quality Gate

Quality Gate failed

Failed condition 9 Lines to Cover on New Code (is greater than 5)

See analysis details on SonarQube

TerminallyUnderpriced: regexp.MustCompile(`(: |^)(FeeTooLow|FeeTooLowToCompete)$`),
TerminallyUnderpriced: regexp.MustCompile(`(: |^)(FeeTooLow(, MaxFeePerGas too low. MaxFeePerGas: \d+, BaseFee: \d+, MaxPriorityFeePerGas:\d+, Block number: \d+|` +
`, EffectivePriorityFeePerGas too low \d+ < \d+, BaseFee: \d+|` +
`, FeePerGas needs to be higher than \d+ to be added to the TxPool. Affordable FeePerGas of rejected tx: \d+.)?|` +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is making it looser can accidentally match stuff it isn't supposed to.

e.g. "known transaction" also matches "unknown transaction". This has bitten us in the past hence the extreme conservatism now.

@dimriou dimriou merged commit f4294af into develop Feb 9, 2023
@dimriou dimriou deleted the sc-58348/extend_nethermind_errors branch February 9, 2023 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants