Skip to content

Account sequence mismatch causes unnecessary delay in relaying #2249

Closed
@ancazamfir

Description

Summary of Bug

Sometimes Tx simulation failures due to account sequence mismatch causes hermes to unnecessary delay the relaying.
Background:

  • hermes simulates every transaction by calling the simulateTx (application gRPC). If successfully simulated, it uses the gas to determine the Tx fee and then
  • it broadcast_tx_sync-s the transaction (tendermint RPC), i.e. sent to tendermint mempool, which calls the application checkTx.
  • In both cases the verification happens on a copy of the application state, aka checkTx state.
  • the checkTx state is also updated by the application after a block is being committed (deliverTx is run for all Tx-es in a block)

Here are the details of the issue:

  • hermes simulates and broadcast_tx_sync-s N transactions with following sequence numbers: [1,.., K, K+1, ...,N]
    • check Tx state last account sequence is N.
  • tendermint creates a block with [1,..,K]. This involves calling Commit() which:
    • updates the application by calling deliverTx for [1,..,K] and then recreating the check Tx state with latest application state:
      • check Tx state last account sequence is K.
    • then it updates the mempool. Part of the mempool update is recheckTxs() which calls checkTx for [K+1,..,N]
      • check Tx state last account sequence is K+x.
  • hermes tries to simulate a transaction with sequence number N+1 sometime before recheckTxs() above finishes. It gets the error:
    • "account sequence mismatch, expected K+x, got N+1: incorrect account sequence", with x being the number of Tx-es that have been rechecked
  • hermes recaches to K+x and tries to simulate again a bit later only to get: "account sequence mismatch, expected N+1, got K+x: incorrect account sequence"
  • then recaches again to N+1 and the simulation is usually successful as the recheckTxs() above is finished already.

I changed the code locally to use default gas on simulation failure and things look much better. This is because the mempool lock is grabbed during Commit() so broadcast_tx_sync for N+1 in the middle of recheckTxs() will wait for checkTx on [K+1, .., N]

In summary the difference is:

  • simulateTx gRPC works on a checkTx state that can be concurrently updated by the recheckTxs()
  • broadcast_tx_sync RPC will be executed and the call to checkTx done only after recheckTxs() has finished.

Also checked the Go relayer and the simulation is retried a few times in the hope of catching and getting successful at the end of recheckTxs().

Version

all

Steps to Reproduce

  • create two chains (used gaia v7.0.0 that uses tendermint v0.34.14)
  • create channel
  • send 1000 ft-transfer-s
  • start hermes and check logs

Acceptance Criteria

Packet relaying should continue in this case, i.e. deal differently with the account mismatch error when got > expected. Specifically in this case, either use the default gas or retry the simulation (probably the latter).
(TBD - asking SDK for a simulateTx flag that skips the account sequence check. But if this is made available it will probably be in later releases)
Note that for got < expected we still need to recache as this is an indication that some other entity has used the same account/ wallet.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

Metadata

Assignees

Labels

A: bugAdmin: something isn't working

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions