Skip to content
This repository has been archived by the owner on Jun 12, 2023. It is now read-only.

enable targetting to continue in the event of a crash #309

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

andymck
Copy link
Contributor

@andymck andymck commented Jan 31, 2020

If the poc state machine crashes during handle_targetting or handle_challenging the PoC is effectively lost as the add block event with the poc request txn will have passed and will not be seen again after restart. In this case the PoC reverts to requesting state.

This PR addresses this problem by saving the blockhash (along with the PinnedLedger value ) in which the poc request was identified to state and file. If upon a crash and the subsequent restart of the statem and it entering mining state, the presence of the blockhash will be checked for in state and if present will be used to rerun targettng.

The PR also fixes a couple of bugs with the poc_restarts counter:

  1. The counter was not being saved to the statem file as part of save_data/1
  2. The counter was not being loaded from the statem file as part of load_data/1
  3. After a crash/restart off the statem, the counter was being decremented but would not be saved to file until the next state transition. As such if there was another crash prior to that state transition the counter would revert to the previous value and would never reach zero.

Together these bugs meant the kill switch would never be triggered, irrespective of how many times the same PoC restarted.

@andymck andymck marked this pull request as ready for review January 31, 2020 18:52
@andymck andymck force-pushed the andymck/poc-enable-targeting-restarts branch from 1946a7d to f9d9b60 Compare February 3, 2020 10:40
@andymck andymck force-pushed the andymck/poc-enable-targeting-restarts branch from 691460e to e5d7992 Compare February 9, 2020 17:18
…oc_restarts counter not being written to state file

pass updated Data to handle targetting
@andymck andymck force-pushed the andymck/poc-enable-targeting-restarts branch from e5d7992 to 9d3396f Compare February 25, 2020 17:17
Copy link
Contributor

@evanmcc evanmcc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry this is taking me so long. I'm like 95% on this but I am having trouble thinking through all the failure scenarios.

handle_challenging({Entropy, ignored}, TargetPubkeyBin, ignored, Height, Ledger, Vars, Data#data{challengees=[]});
{ok, V} ->
{ok, {TargetPubkeyBin, TargetRandState}} = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars),
{ok, {TargetPubkeyBin, TargetRandState}} = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have eyes like a hawk :) fixed

retry := Retry,
receipts_timeout := ReceiptTimeout,
poc_restarts := PoCRestarts,
targeting_data := TargetingData} ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we upgrade here, targeting_data won't be present, is the intention that we're just going to crash until we give up just this once? it's a fine strategy, I just want to be explicit about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, the load_data call will crash and subsequently in the init we will default too requesting state. If you think this is too blunt I can pull each field individually from the loaded term, avoid the crash and default values for missing fields but it would add additional maintenance overhead that prob is not warranted.

Height = blockchain_block:height(Block),
%% TODO discuss making this delay verifiable so you can't punish a hotspot by
%% intentionally fast-challenging them before they have the block
%% Is this timer necessary ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. we intentionally delay a bit because if the block has not propagated (i.e. we're on the rising edge of the gossip wave), the remote node might not have seen the block yet, and won't be able to decode the challenge, and will fail. It's always possible, even with the delay, but we do this to give the other end a chance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha

mining(enter, _State, #data{targeting_data = {targeting, BlockHash, PinnedLedger}} = Data)->
%% sorry, have to send msg via self here as state enters cannot insert events..bah...
%% so I either send it here or during init..feels better here
self ! {retry_targeting, BlockHash, PinnedLedger},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear about the logic here. can you spell out where this gets cleared, and maybe what the flow looks like for failure? it looks like we might lose a tick of the timer at each state transition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this state enter will only be hit if we transition to mining state AND if the state field targeting_data is set. This will be the case if we crash out whilst handling targeting. As such this state enter is only going to be hit if we restarted the POC statem, reloaded our state and that state had targeting data set.

The targeting data will then be cleared if we subsequently move to receiving state as part of handle_challenging or if we move too requesting state from either handle_challenging or from handle_targeting itself. So basically it will be cleared if we succeed in targeting or challenging or if either of those fail.

@@ -74,7 +74,8 @@
mining_timeout = ?MINING_TIMEOUT :: non_neg_integer(),
retry = ?CHALLENGE_RETRY :: non_neg_integer(),
receipts_timeout = ?RECEIPTS_TIMEOUT :: non_neg_integer(),
poc_restarts = ?POC_RESTARTS :: non_neg_integer()
poc_restarts = ?POC_RESTARTS :: non_neg_integer(),
targeting_data :: {atom(), blockchain_block:hash(), binary()} | undefined
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the third parameter here a rocksdb snapshot? This isn't safe to serialize to disk as it will not point to anything after a node restart...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to store the block height of the ledger here, so you can use ledger_at to obtain a ledger snapshot when deserializing the stored state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well spotted! now fixed...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants