Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to TXM to detect and purge stuck transactions #12881

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
246b04d
Added the EVM stuck tx detector component
amit-momin Apr 18, 2024
d47d88f
Added stuck tx handling in Confirmer
amit-momin Apr 19, 2024
8c713dc
Fixed linting
amit-momin Apr 19, 2024
05c5a11
Fixed toml config decoding
amit-momin Apr 19, 2024
360d9fa
Fixed config tests
amit-momin Apr 19, 2024
0a04fff
Fixed web resolver config tests
amit-momin Apr 19, 2024
d26473c
Fixed config docs test
amit-momin Apr 19, 2024
1541c5a
Added zkEVM overflow detection and added unit tests for the detector
amit-momin Apr 23, 2024
1d4113d
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 23, 2024
c71ac14
Fixed broken tests after merge
amit-momin Apr 23, 2024
2e46dde
Reverted AutoPurgeConfig validation changes and fixed config tests
amit-momin Apr 23, 2024
8185dc8
Added changeset and fixed config validation logic
amit-momin Apr 23, 2024
20bfb9d
Fixed linting
amit-momin Apr 23, 2024
82e3a5d
Fixed linting
amit-momin Apr 23, 2024
db5264e
Fixed purge attempt builder and added tests
amit-momin Apr 24, 2024
0d5b318
Updated evm.txes contraint to allow non-null nonce if fatal_error
amit-momin Apr 24, 2024
25105a1
Added confirmer test
amit-momin Apr 24, 2024
491f9e3
Adjusted confirmer test to better reflect actual process
amit-momin Apr 24, 2024
338ab26
Fixed linting
amit-momin Apr 24, 2024
dc17b98
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 25, 2024
468b73b
Added purge block num loading on Confirmer startup
amit-momin Apr 25, 2024
736eb89
Updated EVM tx store mock
amit-momin Apr 25, 2024
b2902c6
Fixed linting and testdata
amit-momin Apr 25, 2024
6eb5dc5
Updated stuck tx fatal error messages
amit-momin Apr 25, 2024
a5663ee
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 25, 2024
0cea37b
Updated sql migration file sequence
amit-momin Apr 25, 2024
a3cd9b7
Skipped loading purge block num if auto-purge feature disabled and fi…
amit-momin Apr 26, 2024
02306fe
Fixed linting
amit-momin Apr 26, 2024
a13605b
Renamed function and moved log
amit-momin Apr 26, 2024
c2deae5
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 26, 2024
a5ea4aa
Added stricter config validation
amit-momin Apr 26, 2024
1cfb5cc
Fixed linting
amit-momin Apr 27, 2024
335033a
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 29, 2024
a5ba3a0
Updated auto-purge feature configs to adhere to config naming standards
amit-momin Apr 29, 2024
7513d36
Fixed config doc test and updated changeset
amit-momin Apr 29, 2024
4b41b4b
Updated Scroll and zkEVM config defaults and linted common config
amit-momin Apr 29, 2024
cc5764d
Updated config doc
amit-momin Apr 29, 2024
b1faa07
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin Apr 30, 2024
d67d79b
Generated config doc and fixed linting
amit-momin Apr 30, 2024
c4062f5
Updated config description for AutoPurge.MinAttempts
amit-momin May 9, 2024
54b9c7c
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin May 9, 2024
d244012
Fixed sql migration conflict
amit-momin May 9, 2024
0915400
Fixed linting
amit-momin May 9, 2024
c95d261
Updated stuck tx detector to use PriceMax config and added comments
amit-momin May 13, 2024
db5e5d9
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin May 13, 2024
31d69ba
Fixed linting
amit-momin May 13, 2024
90ec4a3
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin May 22, 2024
d81da52
Updated DetectionApiUrl config example
amit-momin May 22, 2024
a33460b
Merge branch 'develop' into BCI-2941-Implement-auto-purge-stuck-trans…
amit-momin May 23, 2024
c08504b
Fixed issues from latest merge
amit-momin May 23, 2024
88826a7
Renumbered sql migration file
amit-momin May 23, 2024
e400a36
Fixed testdata
amit-momin May 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added confirmer test
  • Loading branch information
amit-momin committed Apr 24, 2024
commit 25105a12bcce428d4704b46af92ef7ea72dd48fe
7 changes: 3 additions & 4 deletions common/txmgr/confirmer.go
prashantkumar1982 marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -469,22 +469,21 @@ func (ec *Confirmer[CHAIN_ID, HEAD, ADDR, TX_HASH, BLOCK_HASH, R, SEQ, FEE]) Pro
purgeAttempt, err := ec.TxAttemptBuilder.NewPurgeTxAttempt(ctx, tx, lggr)
prashantkumar1982 marked this conversation as resolved.
Show resolved Hide resolved
prashantkumar1982 marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

@dimriou dimriou May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if gas bumping fails due to reaching the max limit? I think this logic assumes there is always room for gas bumping. A tx that gets stuck isn't already likely to hit those limits due to gas bumping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could consider this as a misconfiguration since our attempt threshold config would have to be higher than the amount of bumped attempts it takes to hit the configured max price. I don't think we can really get around this in code since our hands are tied with configs. I can add a note to the AutoPurge.MinAttempts config to call this out though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimriou : I assume you mean the max_price limit on our node?
the max_limit set by the caller should be ignored for Purging, in my opinion.

If we are stuck at max_price, then yes, there's no way to recover, except NOP changing the max_price limit on the node. However if we fix the gas bumping bug in BlockHistoryEstimator, I don't think we will ever hit this situation.
Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like ignoring configs could look like unexpected behavior and maybe cause confusion. Since NOPs wouldn't know the inner workings of this auto-purge feature, I could foresee them noticing a tx post with a higher price than their configs and falsely raise a flag only for us to debug and find it was expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about the configs, we shouldn't override them. I guess the NOPs will see the repeated error message and figure out they would have to manually send the txs. I don't see a better alternative.

if err != nil {
errMu.Lock()
errorList = append(errorList, err)
errorList = append(errorList, fmt.Errorf("failed to create a purge attempt: %w", err))
errMu.Unlock()
return
}
// Save purge attempt
if err := ec.txStore.SaveInProgressAttempt(ctx, &purgeAttempt); err != nil {
errMu.Lock()
errorList = append(errorList, fmt.Errorf("saveInProgressAttempt failed for purge attempt: %w", err))
errorList = append(errorList, fmt.Errorf("failed to save purge attempt: %w", err))
errMu.Unlock()
return
}

// Send purge attempt
if err := ec.handleInProgressAttempt(ctx, lggr, tx, purgeAttempt, blockNum); err != nil {
errMu.Lock()
errorList = append(errorList, fmt.Errorf("handleInProgressAttempt failed for purge attempt: %w", err))
errorList = append(errorList, fmt.Errorf("failed to send purge attempt: %w", err))
errMu.Unlock()
return
}
Expand Down
1 change: 0 additions & 1 deletion core/chains/evm/txmgr/attempts.go
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,6 @@ func (c *evmTxAttemptBuilder) NewPurgeTxAttempt(ctx context.Context, etx Tx, lgg
// Transactions being purged will always have a previous attempt since it had to have been broadcasted before at least once
previousAttempt := etx.TxAttempts[0]
keySpecificMaxGasPriceWei := c.feeConfig.PriceMaxKey(etx.FromAddress)
// TODO: Confirm if bump is needed. Either we could have a separate config or a hardcoded 10% to minimize the replacement fee
bumpedFee, _, err := c.EvmFeeEstimator.BumpFee(ctx, previousAttempt.TxFee, etx.FeeLimit, keySpecificMaxGasPriceWei, newEvmPriorAttempts(etx.TxAttempts))
prashantkumar1982 marked this conversation as resolved.
Show resolved Hide resolved
if err != nil {
return attempt, fmt.Errorf("failed to bump previous fee to use for the purge attempt: %w", err)
Expand Down
86 changes: 86 additions & 0 deletions core/chains/evm/txmgr/confirmer_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3128,6 +3128,92 @@ func TestEthConfirmer_ResumePendingRuns(t *testing.T) {
})
}

func TestEthConfirmer_ProcessStuckTransactions(t *testing.T) {
t.Parallel()

db := pgtest.NewSqlxDB(t)
txStore := cltest.NewTestTxStore(t, db)
ethKeyStore := cltest.NewKeyStore(t, db).Eth()
_, fromAddress := cltest.MustInsertRandomKey(t, ethKeyStore)
ethClient := evmtest.NewEthClientMockWithDefaultChain(t)
ethClient.On("SendTransactionReturnCode", mock.Anything, mock.Anything, fromAddress).Return(commonclient.Successful, nil).Once()
lggr := logger.Test(t)
feeEstimator := gasmocks.NewEvmFeeEstimator(t)

// Return 10 gwei as market gas price
marketGasPrice := tenGwei
fee := gas.EvmFee{Legacy: marketGasPrice}
bumpedLegacy := assets.GWei(30)
bumpedFee := gas.EvmFee{Legacy: bumpedLegacy}
feeEstimator.On("GetFee", mock.Anything, []byte{}, uint64(0), mock.Anything).Return(fee, uint64(0), nil)
feeEstimator.On("BumpFee", mock.Anything, mock.Anything, mock.Anything, mock.Anything, mock.Anything).Return(bumpedFee, uint64(10_000), nil)
autoPurgeThreshold := uint32(5)
autoPurgeMinAttempts := uint32(3)
limitDefault := uint64(100)
cfg := configtest.NewGeneralConfig(t, func(c *chainlink.Config, s *chainlink.Secrets) {
c.EVM[0].GasEstimator.LimitDefault = ptr(limitDefault)
c.EVM[0].Transactions.AutoPurge.AutoPurgeStuckTxs = ptr(true)
c.EVM[0].Transactions.AutoPurge.AutoPurgeThreshold = ptr(autoPurgeThreshold)
c.EVM[0].Transactions.AutoPurge.AutoPurgeMinAttempts = ptr(autoPurgeMinAttempts)
})
evmcfg := evmtest.NewChainScopedConfig(t, cfg)
ge := evmcfg.EVM().GasEstimator()
txBuilder := txmgr.NewEvmTxAttemptBuilder(*ethClient.ConfiguredChainID(), ge, ethKeyStore, feeEstimator)
stuckTxDetector := txmgr.NewStuckTxDetector(lggr, testutils.FixtureChainID, "", evmcfg.EVM().Transactions().AutoPurge(), feeEstimator, txStore, ethClient)
ec := txmgr.NewEvmConfirmer(txStore, txmgr.NewEvmTxmClient(ethClient, nil), txmgr.NewEvmTxmConfig(evmcfg.EVM()), txmgr.NewEvmTxmFeeConfig(ge), evmcfg.EVM().Transactions(), cfg.Database(), ethKeyStore, txBuilder, lggr, stuckTxDetector)
servicetest.Run(t, ec)

ctx := testutils.Context(t)
blockNum := int64(100)

t.Run("detects and processes stuck transactions", func(t *testing.T) {
nonce := int64(0)
// Create attempts so that the oldest broadcast attempt's block num is what meets the threshold check
// Create autoPurgeMinAttempts number of attempts to ensure the broadcast attempt count check is not being triggered
// Create attempts broadcasted autoPurgeThreshold block ago to ensure broadcast block num check is not being triggered
tx := mustInsertUnconfirmedTxWithXBroadcastAttempts(t, txStore, nonce, fromAddress, autoPurgeMinAttempts, blockNum-int64(autoPurgeThreshold), marketGasPrice.Add(oneGwei))

// ProcessStuckTransactions should:
// 1. Detect a stuck transaction, confirmed above
// 2. Create a purge attempt for it
// 3. Save the purge attempt to the DB
// 4. Send the purge attempt
err := ec.ProcessStuckTransactions(ctx, blockNum)
require.NoError(t, err)

// Check if the purge attempt was saved to the DB properly
dbTx, err := txStore.FindTxWithAttempts(ctx, tx.ID)
require.NoError(t, err)
require.NotNil(t, dbTx)
latestAttempt := dbTx.TxAttempts[0]
require.Equal(t, true, latestAttempt.IsPurgeAttempt)
require.Equal(t, limitDefault, latestAttempt.ChainSpecificFeeLimit)
require.Equal(t, bumpedFee.Legacy, latestAttempt.TxFee.Legacy)

ethClient.On("SequenceAt", mock.Anything, mock.Anything, mock.Anything).Return(evmtypes.Nonce(1), nil)
ethClient.On("BatchCallContext", mock.Anything, mock.MatchedBy(func(b []rpc.BatchElem) bool {
return len(b) == 4 && cltest.BatchElemMatchesParams(b[0], latestAttempt.Hash, "eth_getTransactionReceipt")
})).Return(nil).Run(func(args mock.Arguments) {
elems := args.Get(1).([]rpc.BatchElem)
// First transaction confirmed
*(elems[0].Result.(*evmtypes.Receipt)) = evmtypes.Receipt{
TxHash: latestAttempt.Hash,
BlockHash: utils.NewHash(),
BlockNumber: big.NewInt(101),
TransactionIndex: uint(1),
Status: uint64(1),
}
}).Once()
// Check if transaction marked as fatal after purge attempt is confirmed
err = ec.CheckForReceipts(ctx, blockNum + 1)
require.NoError(t, err)
dbTx, err = txStore.FindTxWithAttempts(ctx, tx.ID)
require.NoError(t, err)
require.NotNil(t, dbTx)
require.Equal(t, txmgrcommon.TxFatalError, dbTx.State)
})
}

func ptr[T any](t T) *T { return &t }

func newEthConfirmer(t testing.TB, txStore txmgr.EvmTxStore, ethClient client.Client, gconfig chainlink.GeneralConfig, config evmconfig.ChainScopedConfig, ks keystore.Eth, fn txmgrcommon.ResumeCallback) *txmgr.Confirmer {
Expand Down
6 changes: 4 additions & 2 deletions core/chains/evm/txmgr/stuck_tx_detector.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,9 @@ type stuckTxDetector struct {
}

func NewStuckTxDetector(lggr logger.Logger, chainID *big.Int, chainType config.ChainType, cfg stuckTxDetectorConfig, gasEstimator stuckTxDetectorGasEstimator, txStore stuckTxDetectorTxStore, chainClient stuckTxDetectorClient) *stuckTxDetector {
// TODO: ensure to initialize client with the usual security standards
t := http.DefaultTransport.(*http.Transport).Clone()
t.DisableCompression = true
httpClient := &http.Client{Transport: t}
// TODO: Load purgeBlockNumMap with some DB state or confirm rate limit is not needed on first purge after restart
return &stuckTxDetector{
lggr: lggr,
Expand All @@ -66,7 +68,7 @@ func NewStuckTxDetector(lggr logger.Logger, chainID *big.Int, chainType config.C
gasEstimator: gasEstimator,
txStore: txStore,
chainClient: chainClient,
httpClient: &http.Client{},
httpClient: httpClient,
purgeBlockNumMap: make(map[common.Address]int64),
}
}
Expand Down