-
Notifications
You must be signed in to change notification settings - Fork 253
enable targetting to continue in the event of a crash #309
base: master
Are you sure you want to change the base?
Conversation
1946a7d
to
f9d9b60
Compare
691460e
to
e5d7992
Compare
…oc_restarts counter not being written to state file pass updated Data to handle targetting
e5d7992
to
9d3396f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this is taking me so long. I'm like 95% on this but I am having trouble thinking through all the failure scenarios.
src/poc/miner_poc_statem.erl
Outdated
handle_challenging({Entropy, ignored}, TargetPubkeyBin, ignored, Height, Ledger, Vars, Data#data{challengees=[]}); | ||
{ok, V} -> | ||
{ok, {TargetPubkeyBin, TargetRandState}} = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars), | ||
{ok, {TargetPubkeyBin, TargetRandState}} = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra space here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have eyes like a hawk :) fixed
retry := Retry, | ||
receipts_timeout := ReceiptTimeout, | ||
poc_restarts := PoCRestarts, | ||
targeting_data := TargetingData} -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we upgrade here, targeting_data
won't be present, is the intention that we're just going to crash until we give up just this once? it's a fine strategy, I just want to be explicit about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, the load_data call will crash and subsequently in the init we will default too requesting state. If you think this is too blunt I can pull each field individually from the loaded term, avoid the crash and default values for missing fields but it would add additional maintenance overhead that prob is not warranted.
Height = blockchain_block:height(Block), | ||
%% TODO discuss making this delay verifiable so you can't punish a hotspot by | ||
%% intentionally fast-challenging them before they have the block | ||
%% Is this timer necessary ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. we intentionally delay a bit because if the block has not propagated (i.e. we're on the rising edge of the gossip wave), the remote node might not have seen the block yet, and won't be able to decode the challenge, and will fail. It's always possible, even with the delay, but we do this to give the other end a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha
mining(enter, _State, #data{targeting_data = {targeting, BlockHash, PinnedLedger}} = Data)-> | ||
%% sorry, have to send msg via self here as state enters cannot insert events..bah... | ||
%% so I either send it here or during init..feels better here | ||
self ! {retry_targeting, BlockHash, PinnedLedger}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unclear about the logic here. can you spell out where this gets cleared, and maybe what the flow looks like for failure? it looks like we might lose a tick of the timer at each state transition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this state enter will only be hit if we transition to mining state AND if the state field targeting_data is set. This will be the case if we crash out whilst handling targeting. As such this state enter is only going to be hit if we restarted the POC statem, reloaded our state and that state had targeting data set.
The targeting data will then be cleared if we subsequently move to receiving state as part of handle_challenging or if we move too requesting state from either handle_challenging or from handle_targeting itself. So basically it will be cleared if we succeed in targeting or challenging or if either of those fail.
… unsupported state and a bad term
…plane noproc errors from polluting test output
src/poc/miner_poc_statem.erl
Outdated
@@ -74,7 +74,8 @@ | |||
mining_timeout = ?MINING_TIMEOUT :: non_neg_integer(), | |||
retry = ?CHALLENGE_RETRY :: non_neg_integer(), | |||
receipts_timeout = ?RECEIPTS_TIMEOUT :: non_neg_integer(), | |||
poc_restarts = ?POC_RESTARTS :: non_neg_integer() | |||
poc_restarts = ?POC_RESTARTS :: non_neg_integer(), | |||
targeting_data :: {atom(), blockchain_block:hash(), binary()} | undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the third parameter here a rocksdb snapshot? This isn't safe to serialize to disk as it will not point to anything after a node restart...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to store the block height of the ledger here, so you can use ledger_at to obtain a ledger snapshot when deserializing the stored state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well spotted! now fixed...
If the poc state machine crashes during handle_targetting or handle_challenging the PoC is effectively lost as the add block event with the poc request txn will have passed and will not be seen again after restart. In this case the PoC reverts to requesting state.
This PR addresses this problem by saving the blockhash (along with the PinnedLedger value ) in which the poc request was identified to state and file. If upon a crash and the subsequent restart of the statem and it entering mining state, the presence of the blockhash will be checked for in state and if present will be used to rerun targettng.
The PR also fixes a couple of bugs with the poc_restarts counter:
Together these bugs meant the kill switch would never be triggered, irrespective of how many times the same PoC restarted.