[RFC] Simplify checkpoint executor (MystenLabs#7646)

This PR does a few simplifications to the checkpoint executor: 1. We don't need `cold_start` as a member field of the executor. The logic of `run_epoch` should be always the same regardless of how/when we run it because it's always focusing on the current epoch specified by the epoch_store. Specifically, we always enter the schedule loop, and if we have executed the last checkpoint of the epoch, we terminate the loop and return the next epoch committee. 2. We don't need `highest_scheduled_seq_num` as a member field. Sepcifically, `highest_scheduled_seq_num` can be updated each time we finish scheduling a series of checkpoints, and passed around as a variable. There is no need for a global field member. 3. We don't need `latest_synced_checkpoint`. This is part of a larger simplification: a lot of the exiting code complexity comes from the desire to optimize db reads to get the latest synced checkpoint. The overhead of one read per checkpoint is negligible at the grand scheme. We don't need to rely on the channel between state_sync and checkpoint_executor to get the latest synced checkpoint. Instead we could use the channel purely as a notification channel, and when we wake up, we read from the db to get the latest synced checkpoint. Now, if we really really want to optimize this, a cleaner way would be that we read every synced checkpoint into memory, and schedule them as we go. Let me know if this simplification makes sense.
thedave12345 · Jan 25, 2023 · d515c54 · d515c54
1 parent 5f46e5a
commit d515c54
Show file tree

Hide file tree

Showing 4 changed files with 139 additions and 275 deletions.
diff --git a/crates/sui-core/src/checkpoints/checkpoint_executor/metrics.rs b/crates/sui-core/src/checkpoints/checkpoint_executor/metrics.rs
@@ -11,8 +11,7 @@ use std::sync::Arc;
 pub struct CheckpointExecutorMetrics {
     pub last_executed_checkpoint: IntGauge,
     pub checkpoint_exec_errors: IntCounter,
-    pub checkpoint_exec_recv_channel_overflow: IntCounter,
-    pub current_local_epoch: IntGauge,
+    pub checkpoint_exec_epoch: IntGauge,
     pub checkpoint_transaction_count: Histogram,
 }
 
@@ -31,23 +30,17 @@ impl CheckpointExecutorMetrics {
                 registry
             )
             .unwrap(),
-            checkpoint_exec_recv_channel_overflow: register_int_counter_with_registry!(
-                "checkpoint_exec_recv_channel_overflow",
-                "Count of the number of times the recv channel from StateSync to CheckpointExecutor has been overflowed",
-                registry
-            )
-            .unwrap(),
-            current_local_epoch: register_int_gauge_with_registry!(
+            checkpoint_exec_epoch: register_int_gauge_with_registry!(
                 "current_local_epoch",
-                "Current local epoch sequence number",
+                "Current epoch number in the checkpoint executor",
                 registry
             )
             .unwrap(),
             checkpoint_transaction_count: Histogram::new_in_registry(
                 "checkpoint_transaction_count",
                 "Number of transactions in the checkpoint",
-                registry
-            )
+                registry,
+            ),
         };
         Arc::new(this)
     }