Description
Background
We were investigating why some space race deals, such as 55685
for this miner might have failed. Based on the provided info, it seemed pretty clear that the dashboard had the deal as failed, even though the deal was in PreCommit1
on the miner's side.
Presumably this happened because the bot's Lotus node updated its status to failed, with the error:
ClientGetDealInfo: {"jsonrpc":"2.0","result":{"ProposalCid":{"/":"bafyreibc64zanfzw2ltdcee7fvpjh2ay5hwvznti5ilpotwrb3pjx2r4mq"},"State":26,"Message":"error in deal activation: failed to set up called handler: called check error (h: 22758): client: failed to look up deal on chain: deal 55685 not found","Provider":"t010078","DataRef":null,"PieceCID":{"/":"baga6ea4seaqnjtan44uopfnh7jmy7fafgogejexp75llew55m5hu72mo2mydahy"},"Size":133169152,"PricePerEpoch":"62500000","Duration":701069,"DealID":55685,"CreationTime":"2020-09-01T19:39:35.692171291Z"},"id":0}
--
We also noticed that deals 55686, 87, 88, 89, 90, 92, 93, and 94 failed similarly. Deal 55691 did NOT fail this way, but it was made by client t0113, unlike all the failed ones which were made by t0112.
Hypothesis
The conjecture is that client t0112 had a reorg occur around height 22762. The failed deals were all published in block X, triggering ClientEventDealPublished
, and bumping the client FSM state to StorageDealSealing
, which "waits" in OnDealSectorCommitted
. If block X was then reorged, these deals would all error in the first checkFunc
over here.
Potential solution
We could avoid failing if we can't find the deal on-chain in checkFunc
if we also can't find the publish message on chain. This requires also providing the publish message CID to OnDealSectorCommitted
, which should be fine (we could just give it the entire deal object).
#3472 demonstrates this (incompletely).