Account sequence mismatch causes unnecessary delay in relaying #2249
Description
Summary of Bug
Sometimes Tx simulation failures due to account sequence mismatch causes hermes to unnecessary delay the relaying.
Background:
- hermes simulates every transaction by calling the
simulateTx
(application gRPC). If successfully simulated, it uses the gas to determine the Tx fee and then - it
broadcast_tx_sync
-s the transaction (tendermint RPC), i.e. sent to tendermint mempool, which calls the applicationcheckTx
. - In both cases the verification happens on a copy of the application state, aka checkTx state.
- the checkTx state is also updated by the application after a block is being committed (
deliverTx
is run for all Tx-es in a block)
Here are the details of the issue:
- hermes simulates and
broadcast_tx_sync
-sN
transactions with following sequence numbers:[1,.., K, K+1, ...,N]
- check Tx state last account sequence is
N
.
- check Tx state last account sequence is
- tendermint creates a block with
[1,..,K]
. This involves callingCommit()
which:- updates the application by calling
deliverTx
for[1,..,K]
and then recreating the check Tx state with latest application state:- check Tx state last account sequence is
K
.
- check Tx state last account sequence is
- then it updates the mempool. Part of the mempool update is
recheckTxs()
which callscheckTx
for[K+1,..,N]
- check Tx state last account sequence is
K+x
.
- check Tx state last account sequence is
- updates the application by calling
- hermes tries to simulate a transaction with sequence number
N+1
sometime beforerecheckTxs()
above finishes. It gets the error:"account sequence mismatch, expected K+x, got N+1: incorrect account sequence"
, with x being the number of Tx-es that have been rechecked
- hermes recaches to
K+x
and tries to simulate again a bit later only to get:"account sequence mismatch, expected N+1, got K+x: incorrect account sequence"
- then recaches again to
N+1
and the simulation is usually successful as therecheckTxs()
above is finished already.
I changed the code locally to use default gas on simulation failure and things look much better. This is because the mempool lock is grabbed during Commit()
so broadcast_tx_sync
for N+1
in the middle of recheckTxs()
will wait for checkTx
on [K+1, .., N]
In summary the difference is:
simulateTx
gRPC works on a checkTx state that can be concurrently updated by therecheckTxs()
broadcast_tx_sync
RPC will be executed and the call tocheckTx
done only afterrecheckTxs()
has finished.
Also checked the Go relayer and the simulation is retried a few times in the hope of catching and getting successful at the end of recheckTxs()
.
Version
all
Steps to Reproduce
- create two chains (used gaia v7.0.0 that uses tendermint
v0.34.14
) - create channel
- send 1000 ft-transfer-s
- start hermes and check logs
Acceptance Criteria
Packet relaying should continue in this case, i.e. deal differently with the account mismatch error when got > expected
. Specifically in this case, either use the default gas or retry the simulation (probably the latter).
(TBD - asking SDK for a simulateTx flag that skips the account sequence check. But if this is made available it will probably be in later releases)
Note that for got < expected
we still need to recache as this is an indication that some other entity has used the same account/ wallet.
For Admin Use
- Not duplicate issue
- Appropriate labels applied
- Appropriate milestone (priority) applied
- Appropriate contributors tagged
- Contributor assigned/self-assigned