enable targetting to continue in the event of a crash #309

andymck · 2020-01-31T16:24:33Z

If the poc state machine crashes during handle_targetting or handle_challenging the PoC is effectively lost as the add block event with the poc request txn will have passed and will not be seen again after restart. In this case the PoC reverts to requesting state.

This PR addresses this problem by saving the blockhash (along with the PinnedLedger value ) in which the poc request was identified to state and file. If upon a crash and the subsequent restart of the statem and it entering mining state, the presence of the blockhash will be checked for in state and if present will be used to rerun targettng.

The PR also fixes a couple of bugs with the poc_restarts counter:

The counter was not being saved to the statem file as part of save_data/1
The counter was not being loaded from the statem file as part of load_data/1
After a crash/restart off the statem, the counter was being decremented but would not be saved to file until the next state transition. As such if there was another crash prior to that state transition the counter would revert to the previous value and would never reach zero.

Together these bugs meant the kill switch would never be triggered, irrespective of how many times the same PoC restarted.

src/poc/miner_poc_statem.erl

…oc_restarts counter not being written to state file pass updated Data to handle targetting

evanmcc

sorry this is taking me so long. I'm like 95% on this but I am having trouble thinking through all the failure scenarios.

evanmcc · 2020-03-05T01:12:16Z

src/poc/miner_poc_statem.erl

                    handle_challenging({Entropy, ignored}, TargetPubkeyBin, ignored, Height, Ledger, Vars, Data#data{challengees=[]});
                {ok, V} ->
-                    {ok, {TargetPubkeyBin, TargetRandState}} = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars),
+                    {ok, {TargetPubkeyBin, TargetRandState}}  = blockchain_poc_target_v3:target(ChallengerAddr, Entropy, Ledger, Vars),


extra space here

you have eyes like a hawk :) fixed

evanmcc · 2020-03-05T01:13:49Z

src/poc/miner_poc_statem.erl

+                    retry := Retry,
+                    receipts_timeout := ReceiptTimeout,
+                    poc_restarts := PoCRestarts,
+                    targeting_data := TargetingData} ->


when we upgrade here, targeting_data won't be present, is the intention that we're just going to crash until we give up just this once? it's a fine strategy, I just want to be explicit about it.

correct, the load_data call will crash and subsequently in the init we will default too requesting state. If you think this is too blunt I can pull each field individually from the loaded term, avoid the crash and default values for missing fields but it would add additional maintenance overhead that prob is not warranted.

evanmcc · 2020-03-05T01:17:12Z

src/poc/miner_poc_statem.erl

+            Height = blockchain_block:height(Block),
+            %% TODO discuss making this delay verifiable so you can't punish a hotspot by
+            %% intentionally fast-challenging them before they have the block
+            %% Is this timer necessary ??


yes. we intentionally delay a bit because if the block has not propagated (i.e. we're on the rising edge of the gossip wave), the remote node might not have seen the block yet, and won't be able to decode the challenge, and will fail. It's always possible, even with the delay, but we do this to give the other end a chance.

evanmcc · 2020-03-05T01:20:01Z

src/poc/miner_poc_statem.erl

+mining(enter, _State, #data{targeting_data = {targeting, BlockHash, PinnedLedger}} = Data)->
+    %% sorry, have to send msg via self here as state enters cannot insert events..bah...
+    %% so I either send it here or during init..feels better here
+    self ! {retry_targeting, BlockHash, PinnedLedger},


I'm unclear about the logic here. can you spell out where this gets cleared, and maybe what the flow looks like for failure? it looks like we might lose a tick of the timer at each state transition?

So this state enter will only be hit if we transition to mining state AND if the state field targeting_data is set. This will be the case if we crash out whilst handling targeting. As such this state enter is only going to be hit if we restarted the POC statem, reloaded our state and that state had targeting data set.

The targeting data will then be cleared if we subsequently move to receiving state as part of handle_challenging or if we move too requesting state from either handle_challenging or from handle_targeting itself. So basically it will be cleared if we succeed in targeting or challenging or if either of those fail.

… unsupported state and a bad term

…plane noproc errors from polluting test output

Vagabond · 2020-03-05T19:12:30Z

src/poc/miner_poc_statem.erl

@@ -74,7 +74,8 @@
    mining_timeout = ?MINING_TIMEOUT :: non_neg_integer(),
    retry = ?CHALLENGE_RETRY :: non_neg_integer(),
    receipts_timeout = ?RECEIPTS_TIMEOUT :: non_neg_integer(),
-    poc_restarts = ?POC_RESTARTS :: non_neg_integer()
+    poc_restarts = ?POC_RESTARTS :: non_neg_integer(),
+    targeting_data :: {atom(), blockchain_block:hash(), binary()} | undefined


Isn't the third parameter here a rocksdb snapshot? This isn't safe to serialize to disk as it will not point to anything after a node restart...?

I think you need to store the block height of the ledger here, so you can use ledger_at to obtain a ledger snapshot when deserializing the stored state

Well spotted! now fixed...

andymck marked this pull request as ready for review January 31, 2020 18:52

andymck force-pushed the andymck/poc-enable-targeting-restarts branch from 1946a7d to f9d9b60 Compare February 3, 2020 10:40

Vagabond reviewed Feb 7, 2020

View reviewed changes

src/poc/miner_poc_statem.erl Outdated Show resolved Hide resolved

andymck force-pushed the andymck/poc-enable-targeting-restarts branch from 691460e to e5d7992 Compare February 9, 2020 17:18

enable targetting to continue in the event of a crash. Fix bug with p…

9d3396f

…oc_restarts counter not being written to state file pass updated Data to handle targetting

andymck force-pushed the andymck/poc-enable-targeting-restarts branch from e5d7992 to 9d3396f Compare February 25, 2020 17:17

tidy up

7823078

This was referenced Mar 3, 2020

add tests to cover statem init with bad state file #314

Closed

Save data upon state enters #313

Draft

evanmcc reviewed Mar 5, 2020

View reviewed changes

andymck added 3 commits March 5, 2020 11:27

remove spurious space

cc73c6f

add tests to cover statem init with an empty state file, then with an…

593cd85

… unsupported state and a bad term

save default state to file during init. prevent miner_fake_radio_back…

82915c3

…plane noproc errors from polluting test output

Vagabond reviewed Mar 5, 2020

View reviewed changes

andymck added 2 commits March 6, 2020 11:55

retrieve ledger based on block height when restarting targeting

a81381c

properly handle response

9c2dbbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable targetting to continue in the event of a crash #309

enable targetting to continue in the event of a crash #309

andymck commented Jan 31, 2020 •

edited

Loading

evanmcc left a comment

evanmcc Mar 5, 2020

andymck Mar 5, 2020

evanmcc Mar 5, 2020

andymck Mar 5, 2020

evanmcc Mar 5, 2020

andymck Mar 5, 2020

evanmcc Mar 5, 2020

andymck Mar 5, 2020

Vagabond Mar 5, 2020

Vagabond Mar 5, 2020

andymck Mar 6, 2020

enable targetting to continue in the event of a crash #309

Are you sure you want to change the base?

enable targetting to continue in the event of a crash #309

Conversation

andymck commented Jan 31, 2020 • edited Loading

evanmcc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andymck commented Jan 31, 2020 •

edited

Loading