Description
High priority
-
If the protocol state
protocol.StateUpdater
reports that it encountered an invalid state transition (InvalidServiceEventError
), we currently just set theStateUpdater.InvalidStateTransitionAttempted
flag to true and proceed as usual:flow-go/state/protocol/badger/mutator.go
Lines 1124 to 1128 in 0172dd1
However, we don't know that theupdater
's internal state is a valid Protocol State. There is no atomicity requirement on theStateUpdater
as far as I can tell, in that it either applies the update entirely or not at all. For all we know, it could be complete garbage. Nevertheless, we just build whatever (potentially broken) interim state that we aborted the update at.Suggestion:
- I think it it is ok to treat all protocol state operations mutations in a block as one atomic state transition. It either succeeds in its entirety or in case of a failure, there is no update to the protocol state (except setting
InvalidStateTransitionAttempted
totrue
). - I think it is straight forward to extend the protocol state
Updater
to "drop all modifications ifInvalidStateTransitionAttempted
" and document this properly throughout the code base. In a nutshell, within theBuild
method I would check thestate.InvalidStateTransitionAttempted
first; in that case we could just return a copy of theparentState
with only thestate.InvalidStateTransitionAttempted
set to true. - While the code changes are relatively small, we need to properly test this and add detailed documentation to the
Updater
, the corresponding interface, and ideally also the places where we call theUpdater
Further thoughts:
- If the updater is signalling that it encountered an invalid state transition via an
InvalidStateTransitionAttempted
, should the updater maybe already set its ownInvalidStateTransitionAttempted
flag? We would still raise the error to signal the failure to the caller. However, the caller would no longer be required to explicitly callupdater.SetInvalidStateTransitionAttempted()
(I'd leave that method, to allowUpdater
external logic to set this flag for other reasons).
(for further details, see this PR comment)
☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931
- I think it it is ok to treat all protocol state operations mutations in a block as one atomic state transition. It either succeeds in its entirety or in case of a failure, there is no update to the protocol state (except setting
-
Explicitly encode a node's epoch lifecycle status (e.g. joining, active, leaving) in the protocol state.
Currently, we detect that a node is authorized to be part of the network but not to actively contribute to extending the block with the following conditional:
registeredButNotAuthorizedForCurrentEpoch := identity.Weight == 0 && !identity.Ejected
Context:
- There are two reasons for node X to have 0 weight. Either (1) X has been ejected, or (2) X is authorized for a past or upcoming epoch, but not the current one. We detect (2) implicitly at the moment.
- Currently, we maintain a dynamic weight but only differentiate between zero weight (not active) and positive (active). It is not clear whether the protocol would benefit from dynamically changing trust weight in the future. Reasoning:
-
For consensus participants (consensus nodes, collector nodes), we anyway use their initial weight. The practical benefits are massive, because this enables light clients that don't need to download and locally check every block header.
Leader selection is also pre-determined for an entire epoch based on initial weight.
-
For other Verification [VNs] and Execution [ENs] nodes we currently don't have meaningful ways for weight-based load balancing. ENs anyway need to execute every block and our chunk-assignment algorithm generates uniform load across all VNs. It is not clear whether weighted protocol variants even exist (and even if such algorithms exist [open research topic], a large amount of complicated software changes would likely be needed).
-
Therefore, we conclude that at the moment only a binary notion of weight (
zero
vspositive
) is used and we don't believe the protocol will benefit from a more fine-grained differentiation in the foreseeable future.Further reasoning:
- For clarity, we could will replace the dynamic weight by an enum representing the
participation status
with possible values ofactive
[positive weight] andjoining
,leaving
[zero weight] - Intuitively, it seems reasonable to include the
Ejected
status just as one possible value of the this enum. If we kept theparticipation status
and theEjected
flag separately, we could easily differentiate between (joining
,ejected
) and (leaving
,ejected
). However, I don't think this differentiation is conceptually useful. If a node is ejected, we should just drop it from the protocol state at the next possible occasion. Whatever state it had before (joining, active, leaving) is no longer relevant for the core protocol layer (For the system smart contracts, that might be different)
Suggested revision:
- replace the dynamic weight by an enum representing the
participation status
with possible values ofactive
[positive weight] andjoining
,leaving
,ejected
For further context, this todo is based on the PR comments:
- [Dynamic Protocol State] Split
flow.Identity
inflow.IdentitySkeleton
andflow.DynamicIdentity
#4545 (comment) - [Dynamic Protocol State]
EpochStateContainer
stores epoch active identities #4834 (comment)
☝️ implemented in [Dynamic Protocol State] Replace dynamic weight and ejection flag with 'epoch participation status' #5039
-
At the moment, the
protocol_state.Updater
returnsInvalidServiceEventError
for obviously broken inputs. However, lets take a full pass over the implementation to make sure we correctly differentiate between internal exceptions and external errors in all cases.Context
-
In a sense, all protocol-state-changing events should be considered as external inputs. Those inputs can be broken (though they should not be because they passed verification).
-
I think the updater should clearly distinguish between
- an input event representing an invalid state transition (sentinel error)
- a failed sanity check or consistency requirement indicating a corrupted protocol state
The node internally only maintains protocol state snapshots -- should anything be wrong with them, then it is an exception. In contrast, the blocks essentially only represent state transitions. Should something not match up with the state transition, something is wrong with the block or the process generating the state transitions that are recorded in the block (e.g. epoch smart contract).
☝️ implemented in [Dynamic Protocol State] Dedicated protocol state machine for epoch fallback #4931
-
-
Take a fresh look at the
InvalidStateTransitionAttempted
and its relation to the node-internalEpochEmergencyFallbackTriggered
event and the related flag inInstanceParams
. See PR comment here. In context of this discussion, we will probably also have to consider the option for leaving epoch fallback mode as well.👉 moved to separate issue [Dynamic Protocol State] passing the epoch commitment deadline without an
EpochCommit
event #5104
Medium priority
-
Review precise conditions under which an epoch transition occurs and how that interacts with epoch emergency fallback. See Jordan's comment☝️ @jordanschalm:
This has changed again since the original comment. We do not explicitly check thatSetupID != nil
, but we do check in both cases that we are in the EpochCommitted phase before transitioning to the next epoch (which we should). I don't think there is a need to change anything here any more.Update: we concluded that this sanity check might be helpful in case the node has corrupted its own state. However, there are multiple similar checks already in place. So we decided to not overly bloat the code with sanity checks that are most likely never needed.
-
following Jordan's comment on PR #4559, renameProtocolStateEntry
toIdentityTableSnapshot
(🎏 does it contain more than the identity table?)☝️ @jordanschalm:
Now, it does contain more than the identity table. -
This is based on Jordan's comment on PR #4559 on the data structure
ProtocolStateEntry
, which is mixing conceptually different approaches at the moment:- Links to state-change events, from which can be derived the core useful piece of data (the Identity Table)
- The Identity Table itself
In the following, we are going to refer to the
ProtocolStateEntry
asIdentityTableSnapshot
(i.e. assume the previous ToDo is already implemented)Concerns are:
I'm wondering how this approach will scale when we have more different state-change events. Would we need to put them all in here? Ideally, after a state-change event is processed, we store only the resulting IdentityTable.
Conceptually, the current approach could become confusing and error-prone when we add slashing/ejection events, which modify one node and leave the rest unchanged (maintenance and extendability concern). Furthermore, we need to separate the state transition logic anyway to implement construction and verification of Identity Table changes in consensus nodes (see issue https://github.com/dapperlabs/flow-go/issues/5517).
Proposal:
-
Implement a separate state machine
$\mathcal{S}$ for evolving the protocol state. -
Conceptually, each consensus node has knowledge of the parent block's
$IdentityTableSnapshot_{N-1}$ (that takes effect, after the QC in the child block confirms its validity, so specifically for the child block) -
The child block
$N$ might contain state changing operations$\Delta_N$ (example:$\Delta_N$ might be anEpochSetup
event, ejection request, slashing ...) -
the proposer of the child block will include the hash of the resulting
$IdentityTableSnapshot_{N}$ , which is a commitment to the resulting Identity Table after the block$N$ is processed. Both primary as well as consensus replicas compute / validate$IdentityTableSnapshot_{N}$ by applying the state machine$\mathcal{S}$ :$IdentityTableSnapshot_{N} = \mathcal{S}(IdentityTableSnapshot_{N-1}, \Delta_N, view)$ where
$view$ is the view of the block containing$\Delta_N$ (I think this will be very helpful for consistency checks)
-
Extend the unit tests for
RichProtocolStateEntry
by also testing scenarios:- constructing
RichProtocolStateEntry
for epoch setup phase where no prior epoch exist (i.e. first epoch setup phase after spork) 👉 ToDo - constructing
RichProtocolStateEntry
for epoch commit phase where no prior epoch exist (i.e. first epoch commit phase after spork) 👉 ToDo
☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080
- constructing
-
For the
ProtocolState
storage layer abstraction I would suggest to also include a cache for the secondary index (retrieving IdentityTable by block ID) 👉 PR comment☝️ implemented in [Dynamic Protocol State] TODOs and refactoring, part 2 #5080
Low priority
-
I feel the method
Members
ofprotocol.Cluster
should returnIdentitySkeleton
s (👉 code) We are doing multiple conversions toIdentitySkeleton
in code usingprotocol.Cluster
, which I feel indicate a design inconsistency.☝️ implemented in [Dynamic Protocol State] Changing structure of participants in
EpochSetup
#4726 -
For theProtocolState
storage layer abstraction, review possibility for 'caching theRichProtocolStateEntry
on store' 👉 PR comment❌ Won't implement: On the happy path, the Protocol State changes rarely throughout the epoch. The protocol only requires 3 stores per epoch. On the one hand, there is a non-negligible overhead for the data base read on the first retrieve. On the other hand, this is extremely rare and the latency cost is not great but reasonable.
-
Cache
GlobalParams
rather than reading per request and panicking on unexpected errors [Dynamic Protocol State] Read-only interfaces of protocol state #4579 (comment) -
ForEpochSetup
andEpochCommit
events, we have a long list of consistency checks in theRichProtocolStateEntry
constructorI originally proposed to introduce the convention ofnil ≡ ZeroID
forEpochSetup
andEpochCommit
, but as Yurii pointed out, this would be violating "safety by default" approach:some values should always be not nil, otherwise it will crash during the development, we lose this property if we implement ZeroID for nil epoch setupIf we have time, lets revisit theRichProtocolStateEntry
constructor and try to add convenience functions to reduce the amount of boilerplate code that we have to write for such sanity checks. Presumably, we want to have analogous sanity checks in other parts of the code and a more compact way of implementing them would benefit code readability.For details, please see this PR discussion: [Dynamic Protocol State]ProtocolStateEntry
refactoring #4721 (comment)❌ Won't implement: sanity checks are reasonably consolidated. While they add quite a lot of lines, their explicitness and easy readability are benefits.
-
My gut feeling is that we should try to remove
EpochStatus
as this is now superseded byProtocolStateEntry
. I think our understanding of implementation backing theInitialProtocolState
andDynamicProtocolState
has evolved. Specifically, theRichProtocolStateEntry
is already available as part of the API, which contains Epoch Setup and Commit events for the current, next and previous epochs (if available).☝️ implemented in [Dynamic Protocol State] Remove
EpochStatus
#5089I am not sure how much work the following refactoring would be. I think we should only invest a limited amount of time, maybe a week to get this done (if it takes significantly more time, I think we should leave the code as is).
- I would be inclined to not expose the
RichProtocolStateEntry
struct through the interface. Instead, I think we should utilize an interface that exposes all relevant data through methods. We already have an interface for all the Epoch information:EpochQuery
- The
InitialProtocolState
interface could be completely implemented byRichProtocolStateEntry
(removing the need for theInitialProtocolStateAdapter
wrapper struct) - similarly, the
DynamicProtocolState
could also be completely implemented byRichProtocolStateEntry
(in most cases I think we haveGlobalParams
anyway available). Thereby we would remove the need forDynamicProtocolStateAdapter
, which would further simplify the code. - I feel this would be one step towards consolidating the
protocol.State
and theprotocol.ProtocolState
- I would be inclined to not expose the
Code structure and clarity
-
Epoch setup event should only provide
IdentitySkeleton
s
☝️ implemented in [Dynamic Protocol State] Changing structure of participants inEpochSetup
#4726 -
I feel it would be the most consistent if all initial identities would only be represented by
IdentitySkeleton
s. Thereby, we have a clear type distinction between fullIdentity
that can change from block to block andIdentitySkeleton
that is fixed for the the Epoch.-
In particular for the
Cluster
, I would suggest to haveMembers()
return anIdentitySkeletonList
. I hope this wouldn't be a huge refactor. Though, we probably would need to take a closer look atCluster.initialClusterMembers
. This is because thecommittees.Cluster
construct the cluster members at the root block frominitialClusterMembers
:flow-go/consensus/hotstuff/committees/cluster_committee.go
Lines 83 to 89 in c9087fc
Cluster
is a good place to implement this convention of how we translateIdentitySkeleton
to fullIdentity
at the very beginning of an epoch.☝️ implemented in [Dynamic Protocol State] Changing structure of participants in
EpochSetup
#4726
-
-
From my perspective, we are approaching a complexity / maintainability challenge w.r.t. the protocol state. The protocol state has been complex for a long time and we have iteratively refactored it to manage its growing set of functionality. With the addition of the dynamic protocol state, I feel we are at the point again where we should be thinking about further modularization and separation of concerns.
Areas I would suggest to consider:
- The the moment,
committee.Cluster
andcommittee.Consensus
live on top ofprotocol.State
(👉 code). As a result, some logic for managing the Identity Table is inprotocol.State
while other logic is in the committees. As a result, we convert forth and back between fullIdentitie
s andIdentitySkeleton
s (👉 code), which I feel is prone to inconsistent code modifications. - I think it would be beneficial to strongly separate the logic on a lower level. We introduce largely independent components for
- the Identity Table, including the
committee.Cluster
andcommittee.Consensus
. They manage their own portion of the state in the DB and provide mutator operations that return the necessary data base operations - The Epoch information
- everything related to indexing blocks and their processing status (finality, sealing etc)
- the Identity Table, including the
- The
protocol.State
would then become the high-level orchestration layer. On the retrieval end, the protocol state would compose the interfaces from the lower-level components listed above. When adding new blocks, theprotocol.State
would delegate to the respective logic of the lower-level components, collect the resulting data base changes and apply them all in one single transaction.
Not really sure if this is tractable; there will probably be a lot of challenges when we try to implement. Nevertheless, I feel it would still be beneficial to explore this further to at least get an idea how complex it is and whether we can maybe incrementally work on this over time.
- The the moment,
-
Currently, there is a lot of ambiguity in the naming ofprotocol.State
,protocol.Snapshot
,protocol.ProtocolState
,protocol.InitialProtocolState
,protocol.DynamicProtocolState
(see Jordan's PR comment)RenameDynamicProtocolState
toDynamicProtocolSnapshot
Consolidateprotocol.InitialProtocolState
withprotocol.Epoch
APIs. Currently there is a lot of overlap- ~As mentioned here consider implementing interfaces for accessing dynamic protocol state using
RichProtocolStateEntry
. ~
-
We are exposingEntry
in protocol state API to access low-level data, this is handy in few cases such as bootstrapping, but also dangerous as mentioned here. Consider removing it from higher level APIs. This should be handled after having a working solution since it's hard to say how to perform certain actions that depend on existence ofEntry
.
Potentially obsolete comments and suggestions (for completeness)
-
This is based on (👉 Jordan's PR comment)
ProtocolStateEntry.NextEpochProtocolState
contains a lot of duplicated data fromProtocolStateEntry
. What if we replaced the fieldNextEpochProtocolState
with:- a field,
NextEpochEventIDs
- a field,
NextEpochIdentities
- a method,
NextEpochProtocolState()
, which aggregates the next epoch protocol state
Separately, if we use an explicit state enum (suggestion) rather than using zero-weight, we could always track all current identities together.
Comment:
-
I think the duplication issue will be addressed in the ToDo further above:
ProtocolStateEntry
would then be an ordered pair for two adjacent epochs (N, N+1), where the earlier Epoch can benil
if and only if N+1 is the first epoch of the spork / genesis. -
regarding the suggestion to 'track all current identities together':
As discussed in [Proposal] Detailed specification for framework functionality to support future slashing, I believe is is fundamentally necessary to track the identity table for the next and the current epoch separately (see section 'necessity of deferred operations'), which in turn necessitates data duplication.
- a field,
-
The concepts ofPreviousEpoch
andCurrentEpoch
andNextEpoch
in theProtocolStateEntry
could potentially lead to misunderstandings resulting in subtle security vulnerabilities. Specifically, they need to be updated at the Epoch transition, despite the information technically still being the same. Furthermore, they are imprecise compared to identifying an epoch uniquely via its counter, becauseprevious
,current
,next
are defined with respect to ca changing reference frame.Todo:identify Epochs solely by their epoch counteraProtocolStateEntry
would then be an ordered pair for two adjacent epochs (N, N+1), where the earlier Epoch can benil
if and only if N+1 is the first epoch of the spork / genesis.similarlyRichProtocolStateEntry
, would also be an ordered pair for two adjacent epochs (N, N+1)
Notes:With this change, we are removing the previous epoch from theProtocolStateEntry
.This is intended for the following reason: At the beginning of the Epoch (specifically all the way throughout the staking phase), the pair (N-1, N) will contain information about the last epoch (N). With the epoch setup event, the protocol state changes to carrying the pair (N, N+1). Therefore, any light client that is trusting a block of Epoch N can retrieve trustlessly information about the prior epoch and the next epoch by querying two different block header (+ short sequence of descendants to prove finality). I think the additional networking and cpu consumption is negligible while the software simplification is notable.
We decided to not do this, for the following reason. The conceptual status of a node changes at epoch boundaries for nodes joining or leaving. Hence, the conceptual complexity remains. We chose to work with a conceptually more intuitive model at the code level, where the identity table changes at epoch switchover. While this complicates the safety and consistency proofs, it move this complexity to a formal (mathematical) level outside of the code, where we have the drastically more powerful tool of formal proofs at our disposal to guarantee correctness.