Conversation
Mirko-von-Leipzig
left a comment
There was a problem hiding this comment.
I've left quite a few comments on what I think next steps can/should be. @sergerad if they make sense, could you collate them into the tracking issue? If you disagree them we can litigate in that issue or here.
In terms of this PR, I'm happy to merge as is. There are a few potential paper-cuts but they should be rare.
Something else that bothers me is that the control flow and responsibilities of each part aren't super clear to me. I don't have an exact complaint, just that it feels a bit fuzzy and that the interface between each isn't clear.
| /// Cache of events received from the mempool that predate corresponding network accounts. | ||
| /// Grouped by account prefix to allow targeted event delivery to actors upon creation. | ||
| predating_events: HashMap<NetworkAccountPrefix, IndexMap<TransactionId, Arc<MempoolEvent>>>, |
There was a problem hiding this comment.
This isn't strictly accurate. What these are meant to be are events that are still inflight from transactions that have not yet been committed (nor reverted).
These are replayed to an actor when it starts up so that it doesn't "miss" any events. To this end they need to be stored and replayed in chronological order, as if the actor had been present when they arrived.
| #[derive(Clone)] | ||
| struct ActorHandle { | ||
| event_tx: mpsc::Sender<Arc<MempoolEvent>>, | ||
| cancel_token: CancellationToken, |
There was a problem hiding this comment.
I think this is redundant. The actor should already cancel if the channel gets closed so that's essentially the same thing.
| let mut interval = tokio::time::interval(self.ticker_interval); | ||
| interval.set_missed_tick_behavior(time::MissedTickBehavior::Skip); |
There was a problem hiding this comment.
I think this is no longer needed?
| /// Address of the remote prover. If `None`, transactions will be proven locally, which is | ||
| /// undesirable due to the performance impact. | ||
| tx_prover_url: Option<Url>, |
There was a problem hiding this comment.
Unrelated to this PR, but I wonder if we shouldn't move this remote/local division by always taking a URL in the components themselves, and then having the node's CLI figure out whether to spawn a prover locally or not.
This would require quite a bit of work though because the prover worker cannot queue requests. So nevermind :)
| // Handle actor result. | ||
| result = self.coordinator.next() => { | ||
| result?; | ||
| }, |
There was a problem hiding this comment.
I think the co-ordinator should probably never return an error? It seems like its only on semaphor failure or channel failure but I don't think those are valid reasons to crash the ntx.
| let (chain_tip_header, chain_mmr) = store | ||
| .get_latest_blockchain_data_with_retry() | ||
| .await? | ||
| .expect("store should contain a latest block"); | ||
| let mut mempool_events = block_producer | ||
| .subscribe_to_mempool_with_retry(chain_tip_header.block_num()) | ||
| .await | ||
| .context("failed to subscribe to mempool events")?; | ||
|
|
||
| // Unlock the block-producer's block production. The block-producer is prevented from | ||
| // producing blocks until we have subscribed to mempool events. | ||
| // | ||
| // This is a temporary work-around until the ntx-builder can resync on the fly. | ||
| self.bp_checkpoint.wait().await; |
There was a problem hiding this comment.
A medium term goal is to decouple the block-producer and ntx builder. Currently they are coupled because the ntx builder must first sync to the latest committed state from the store and we need that to align with the block-producer. This is enforced by using the bp_checkpoint barrier.
I've been thinking about how we can decouple these so that the ntx builder can start up independently at any time as a separate component.
There are effectively two things that need to be sync'd:
- The chain state (latest header & MMR) which is used to align the mempool subscription.
- Existing network accounts loaded from store.
(1) is trivially solveable by simply repeating it in a loop until it succeeds i.e. (a) read state from store, (b) subscription, if desync'd then repeat.
(2) we can avoid entirely by creating a separate task who's goal it is to fetch network accounts from the store in chronological order until it reaches the chain tip. The task sends these as a new kind of event via a channel. In this way the main loop is eventually informed of all pre-existing accounts and it can decide to spin them up or not, or maybe they're already running.
| }, | ||
| // Broadcast to all actors. | ||
| MempoolEvent::TransactionsReverted(txs) => { | ||
| self.coordinator.broadcast(event.clone()).await; |
There was a problem hiding this comment.
Ideally this wouldn't be a broadcast as this wakes up all sleeping actors.
We technically know which actors these transactions interacts with. We could figure that out and be more specific here.
| let pruned_block_height = | ||
| (chain_state.chain_mmr.chain_length().as_usize().saturating_sub(MAX_BLOCK_COUNT)) | ||
| as u32; | ||
| chain_state.chain_mmr.prune_to(..pruned_block_height.into()); |
There was a problem hiding this comment.
Remind me why chain state isn't part of self? Is it a borrowing issue?
| None => { | ||
| // There are no actors to wait for. Wait indefinitely until actors are spawned. | ||
| std::future::pending().await | ||
| }, |
There was a problem hiding this comment.
Nit: since this is so different from the rest, consider moving it out using
let Some(result) = self.actor_join_set.join_next().await else {
return std::future::pending().await
};There was a problem hiding this comment.
I've written about this a bit before #1314 (comment), but I'll repeat it here for visibility.
Some thoughts on how we can reduce the number of active account actors. This is largely orthogonal to whether we use an actor framework or just keep it simple with our current approach.
Some of these thoughts might be 💩
Self shut down
Let's allow accounts to shut themselves down. They can do this by simply exiting with return Ok(()). The coordinator then interprets this as a deliberate decision and does not restart the actor automatically.
Accounts could shut themselves down if they have no useful notes and haven't received an event in e.g. 5 minutes.
This technically introduces a race condition, where the actor shut downs and then an event arrives which should have kept it alive. In order to detect this, we could have the actor return the event receiver on exist i.e. return Ok(rx_events). The coordinator can then check if the receiver is empty and if it isn't, restart the actor which would mean it would reprocess things.
Blacklisting accounts
The co-ordinator can keep a hashmap of accounts that crash or return an error. Once the crash count reaches some limit they get added to a blacklist and are no longer considered. This should be fairly straight-forward to add.
Standardizing account startup
I think at the moment we have separate flows for starting an account from an event, or starting an account from the store. I think we can connect these by always first getting the account state from the store. If there is no account, then the actor waits until it receives a new account event. We can similarly use the timeout strategy here so that the actor shuts down if the new account event doesn't arrive within a few minutes after start up.
This means we have the same flow for all actors.
7a61e3f to
9acd64e
Compare
Context
The Network Transaction Builder is being refactored so that each network account's responsibilities are executed as dedicated tasks.
Relates to #1056.
Some of the additions are due to file location changes (E.G. account_state.rs)
Changes
Actorstruct to NTX Builder stack for housing state and logic pertaining to individual network accounts.Coordinatorto NTX Builder for coordinatingActorsand feeding through mempool events.