Skip to content

feat: NTX Builder Actor Refactor#1435

Merged
sergerad merged 4 commits intonextfrom
ntx-refactor-2
Dec 11, 2025
Merged

feat: NTX Builder Actor Refactor#1435
sergerad merged 4 commits intonextfrom
ntx-refactor-2

Conversation

@sergerad
Copy link
Collaborator

@sergerad sergerad commented Dec 8, 2025

Context

The Network Transaction Builder is being refactored so that each network account's responsibilities are executed as dedicated tasks.

Relates to #1056.

Some of the additions are due to file location changes (E.G. account_state.rs)

Changes

  • Add Actor struct to NTX Builder stack for housing state and logic pertaining to individual network accounts.
  • Add Coordinator to NTX Builder for coordinating Actors and feeding through mempool events.
  • Address edge case to handle mempool events that predate network account creation.
  • Stopped using NTB acronym and instead NTX Builder to avoid confusion.

@sergerad sergerad requested review from Mirko-von-Leipzig, bobbinth and drahnr and removed request for Mirko-von-Leipzig December 9, 2025 00:16
@sergerad sergerad changed the title feat: NTB Actor Refactor feat: NTX Actor Refactor Dec 9, 2025
@sergerad sergerad changed the title feat: NTX Actor Refactor feat: NTX Builder Actor Refactor Dec 9, 2025
Copy link
Collaborator

@Mirko-von-Leipzig Mirko-von-Leipzig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left quite a few comments on what I think next steps can/should be. @sergerad if they make sense, could you collate them into the tracking issue? If you disagree them we can litigate in that issue or here.

In terms of this PR, I'm happy to merge as is. There are a few potential paper-cuts but they should be rare.

Something else that bothers me is that the control flow and responsibilities of each part aren't super clear to me. I don't have an exact complaint, just that it feels a bit fuzzy and that the interface between each isn't clear.

Comment on lines +89 to +91
/// Cache of events received from the mempool that predate corresponding network accounts.
/// Grouped by account prefix to allow targeted event delivery to actors upon creation.
predating_events: HashMap<NetworkAccountPrefix, IndexMap<TransactionId, Arc<MempoolEvent>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't strictly accurate. What these are meant to be are events that are still inflight from transactions that have not yet been committed (nor reverted).

These are replayed to an actor when it starts up so that it doesn't "miss" any events. To this end they need to be stored and replayed in chronological order, as if the actor had been present when they arrived.

#[derive(Clone)]
struct ActorHandle {
event_tx: mpsc::Sender<Arc<MempoolEvent>>,
cancel_token: CancellationToken,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant. The actor should already cancel if the channel gets closed so that's essentially the same thing.

Comment on lines +141 to +142
let mut interval = tokio::time::interval(self.ticker_interval);
interval.set_missed_tick_behavior(time::MissedTickBehavior::Skip);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer needed?

Comment on lines +76 to +78
/// Address of the remote prover. If `None`, transactions will be proven locally, which is
/// undesirable due to the performance impact.
tx_prover_url: Option<Url>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but I wonder if we shouldn't move this remote/local division by always taking a URL in the components themselves, and then having the node's CLI figure out whether to spawn a prover locally or not.

This would require quite a bit of work though because the prover worker cannot queue requests. So nevermind :)

Comment on lines +168 to +171
// Handle actor result.
result = self.coordinator.next() => {
result?;
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the co-ordinator should probably never return an error? It seems like its only on semaphor failure or channel failure but I don't think those are valid reasons to crash the ntx.

Comment on lines +126 to +139
let (chain_tip_header, chain_mmr) = store
.get_latest_blockchain_data_with_retry()
.await?
.expect("store should contain a latest block");
let mut mempool_events = block_producer
.subscribe_to_mempool_with_retry(chain_tip_header.block_num())
.await
.context("failed to subscribe to mempool events")?;

// Unlock the block-producer's block production. The block-producer is prevented from
// producing blocks until we have subscribed to mempool events.
//
// This is a temporary work-around until the ntx-builder can resync on the fly.
self.bp_checkpoint.wait().await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A medium term goal is to decouple the block-producer and ntx builder. Currently they are coupled because the ntx builder must first sync to the latest committed state from the store and we need that to align with the block-producer. This is enforced by using the bp_checkpoint barrier.

I've been thinking about how we can decouple these so that the ntx builder can start up independently at any time as a separate component.

There are effectively two things that need to be sync'd:

  1. The chain state (latest header & MMR) which is used to align the mempool subscription.
  2. Existing network accounts loaded from store.

(1) is trivially solveable by simply repeating it in a loop until it succeeds i.e. (a) read state from store, (b) subscription, if desync'd then repeat.

(2) we can avoid entirely by creating a separate task who's goal it is to fetch network accounts from the store in chronological order until it reaches the chain tip. The task sends these as a new kind of event via a channel. In this way the main loop is eventually informed of all pre-existing accounts and it can decide to spin them up or not, or maybe they're already running.

},
// Broadcast to all actors.
MempoolEvent::TransactionsReverted(txs) => {
self.coordinator.broadcast(event.clone()).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this wouldn't be a broadcast as this wakes up all sleeping actors.

We technically know which actors these transactions interacts with. We could figure that out and be more specific here.

let pruned_block_height =
(chain_state.chain_mmr.chain_length().as_usize().saturating_sub(MAX_BLOCK_COUNT))
as u32;
chain_state.chain_mmr.prune_to(..pruned_block_height.into());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me why chain state isn't part of self? Is it a borrowing issue?

Comment on lines +211 to +214
None => {
// There are no actors to wait for. Wait indefinitely until actors are spawned.
std::future::pending().await
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since this is so different from the rest, consider moving it out using

let Some(result) = self.actor_join_set.join_next().await else {
    return std::future::pending().await
};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've written about this a bit before #1314 (comment), but I'll repeat it here for visibility.


Some thoughts on how we can reduce the number of active account actors. This is largely orthogonal to whether we use an actor framework or just keep it simple with our current approach.

Some of these thoughts might be 💩

Self shut down

Let's allow accounts to shut themselves down. They can do this by simply exiting with return Ok(()). The coordinator then interprets this as a deliberate decision and does not restart the actor automatically.

Accounts could shut themselves down if they have no useful notes and haven't received an event in e.g. 5 minutes.

This technically introduces a race condition, where the actor shut downs and then an event arrives which should have kept it alive. In order to detect this, we could have the actor return the event receiver on exist i.e. return Ok(rx_events). The coordinator can then check if the receiver is empty and if it isn't, restart the actor which would mean it would reprocess things.

Blacklisting accounts

The co-ordinator can keep a hashmap of accounts that crash or return an error. Once the crash count reaches some limit they get added to a blacklist and are no longer considered. This should be fairly straight-forward to add.

Standardizing account startup

I think at the moment we have separate flows for starting an account from an event, or starting an account from the store. I think we can connect these by always first getting the account state from the store. If there is no account, then the actor waits until it receives a new account event. We can similarly use the timeout strategy here so that the actor shuts down if the new account event doesn't arrive within a few minutes after start up.

This means we have the same flow for all actors.

@sergerad sergerad merged commit 0428fa9 into next Dec 11, 2025
6 checks passed
@sergerad sergerad deleted the ntx-refactor-2 branch December 11, 2025 19:22
This was referenced Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants