Skip to content

Commit 8c69d57

Browse files
Pause sync when EE is offline (#3428)
## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>
1 parent aab4a8d commit 8c69d57

File tree

14 files changed

+575
-329
lines changed

14 files changed

+575
-329
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

beacon_node/beacon_chain/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ pub use block_verification::{BlockError, ExecutionPayloadError, GossipVerifiedBl
5959
pub use canonical_head::{CachedHead, CanonicalHead, CanonicalHeadRwLock};
6060
pub use eth1_chain::{Eth1Chain, Eth1ChainBackend};
6161
pub use events::ServerSentEventHandler;
62+
pub use execution_layer::EngineState;
6263
pub use fork_choice::{ExecutionStatus, ForkchoiceUpdateParameters};
6364
pub use metrics::scrape_for_metrics;
6465
pub use parking_lot;

beacon_node/execution_layer/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,4 +43,5 @@ fork_choice = { path = "../../consensus/fork_choice" }
4343
mev-build-rs = {git = "https://github.com/ralexstokes/mev-rs", tag = "v0.2.0"}
4444
ethereum-consensus = {git = "https://github.com/ralexstokes/ethereum-consensus"}
4545
ssz-rs = {git = "https://github.com/ralexstokes/ssz-rs"}
46+
tokio-stream = { version = "0.1.9", features = [ "sync" ] }
4647
strum = "0.24.0"

beacon_node/execution_layer/src/engines.rs

Lines changed: 109 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,8 @@ use slog::{debug, error, info, Logger};
99
use std::future::Future;
1010
use std::sync::Arc;
1111
use task_executor::TaskExecutor;
12-
use tokio::sync::{Mutex, RwLock};
12+
use tokio::sync::{watch, Mutex, RwLock};
13+
use tokio_stream::wrappers::WatchStream;
1314
use types::{Address, ExecutionBlockHash, Hash256};
1415

1516
/// The number of payload IDs that will be stored for each `Engine`.
@@ -18,14 +19,74 @@ use types::{Address, ExecutionBlockHash, Hash256};
1819
const PAYLOAD_ID_LRU_CACHE_SIZE: usize = 512;
1920

2021
/// Stores the remembered state of a engine.
21-
#[derive(Copy, Clone, PartialEq, Debug)]
22-
enum EngineState {
22+
#[derive(Copy, Clone, PartialEq, Debug, Eq, Default)]
23+
enum EngineStateInternal {
2324
Synced,
25+
#[default]
2426
Offline,
2527
Syncing,
2628
AuthFailed,
2729
}
2830

31+
/// A subset of the engine state to inform other services if the engine is online or offline.
32+
#[derive(Debug, Clone, PartialEq, Eq, Copy)]
33+
pub enum EngineState {
34+
Online,
35+
Offline,
36+
}
37+
38+
impl From<EngineStateInternal> for EngineState {
39+
fn from(state: EngineStateInternal) -> Self {
40+
match state {
41+
EngineStateInternal::Synced | EngineStateInternal::Syncing => EngineState::Online,
42+
EngineStateInternal::Offline | EngineStateInternal::AuthFailed => EngineState::Offline,
43+
}
44+
}
45+
}
46+
47+
/// Wrapper structure that ensures changes to the engine state are correctly reported to watchers.
48+
struct State {
49+
/// The actual engine state.
50+
state: EngineStateInternal,
51+
/// Notifier to watch the engine state.
52+
notifier: watch::Sender<EngineState>,
53+
}
54+
55+
impl std::ops::Deref for State {
56+
type Target = EngineStateInternal;
57+
58+
fn deref(&self) -> &Self::Target {
59+
&self.state
60+
}
61+
}
62+
63+
impl Default for State {
64+
fn default() -> Self {
65+
let state = EngineStateInternal::default();
66+
let (notifier, _receiver) = watch::channel(state.into());
67+
State { state, notifier }
68+
}
69+
}
70+
71+
impl State {
72+
// Updates the state and notifies all watchers if the state has changed.
73+
pub fn update(&mut self, new_state: EngineStateInternal) {
74+
self.state = new_state;
75+
self.notifier.send_if_modified(|last_state| {
76+
let changed = *last_state != new_state.into(); // notify conditionally
77+
*last_state = new_state.into(); // update the state unconditionally
78+
changed
79+
});
80+
}
81+
82+
/// Gives access to a channel containing whether the last state is online.
83+
///
84+
/// This can be called several times.
85+
pub fn watch(&self) -> WatchStream<EngineState> {
86+
self.notifier.subscribe().into()
87+
}
88+
}
89+
2990
#[derive(Copy, Clone, PartialEq, Debug)]
3091
pub struct ForkChoiceState {
3192
pub head_block_hash: ExecutionBlockHash,
@@ -53,10 +114,10 @@ pub enum EngineError {
53114
pub struct Engine {
54115
pub api: HttpJsonRpc,
55116
payload_id_cache: Mutex<LruCache<PayloadIdCacheKey, PayloadId>>,
56-
state: RwLock<EngineState>,
57-
pub latest_forkchoice_state: RwLock<Option<ForkChoiceState>>,
58-
pub executor: TaskExecutor,
59-
pub log: Logger,
117+
state: RwLock<State>,
118+
latest_forkchoice_state: RwLock<Option<ForkChoiceState>>,
119+
executor: TaskExecutor,
120+
log: Logger,
60121
}
61122

62123
impl Engine {
@@ -65,13 +126,20 @@ impl Engine {
65126
Self {
66127
api,
67128
payload_id_cache: Mutex::new(LruCache::new(PAYLOAD_ID_LRU_CACHE_SIZE)),
68-
state: RwLock::new(EngineState::Offline),
129+
state: Default::default(),
69130
latest_forkchoice_state: Default::default(),
70131
executor,
71132
log: log.clone(),
72133
}
73134
}
74135

136+
/// Gives access to a channel containing the last engine state.
137+
///
138+
/// This can be called several times.
139+
pub async fn watch_state(&self) -> WatchStream<EngineState> {
140+
self.state.read().await.watch()
141+
}
142+
75143
pub async fn get_payload_id(
76144
&self,
77145
head_block_hash: ExecutionBlockHash,
@@ -165,17 +233,16 @@ impl Engine {
165233

166234
/// Returns `true` if the engine has a "synced" status.
167235
pub async fn is_synced(&self) -> bool {
168-
*self.state.read().await == EngineState::Synced
236+
**self.state.read().await == EngineStateInternal::Synced
169237
}
170238

171239
/// Run the `EngineApi::upcheck` function if the node's last known state is not synced. This
172240
/// might be used to recover the node if offline.
173241
pub async fn upcheck(&self) {
174-
let state: EngineState = match self.api.upcheck().await {
242+
let state: EngineStateInternal = match self.api.upcheck().await {
175243
Ok(()) => {
176244
let mut state = self.state.write().await;
177-
178-
if *state != EngineState::Synced {
245+
if **state != EngineStateInternal::Synced {
179246
info!(
180247
self.log,
181248
"Execution engine online";
@@ -189,14 +256,13 @@ impl Engine {
189256
"Execution engine online";
190257
);
191258
}
192-
193-
*state = EngineState::Synced;
194-
*state
259+
state.update(EngineStateInternal::Synced);
260+
**state
195261
}
196262
Err(EngineApiError::IsSyncing) => {
197263
let mut state = self.state.write().await;
198-
*state = EngineState::Syncing;
199-
*state
264+
state.update(EngineStateInternal::Syncing);
265+
**state
200266
}
201267
Err(EngineApiError::Auth(err)) => {
202268
error!(
@@ -206,8 +272,8 @@ impl Engine {
206272
);
207273

208274
let mut state = self.state.write().await;
209-
*state = EngineState::AuthFailed;
210-
*state
275+
state.update(EngineStateInternal::AuthFailed);
276+
**state
211277
}
212278
Err(e) => {
213279
error!(
@@ -217,8 +283,8 @@ impl Engine {
217283
);
218284

219285
let mut state = self.state.write().await;
220-
*state = EngineState::Offline;
221-
*state
286+
state.update(EngineStateInternal::Offline);
287+
**state
222288
}
223289
};
224290

@@ -244,12 +310,10 @@ impl Engine {
244310
Ok(result) => {
245311
// Take a clone *without* holding the read-lock since the `upcheck` function will
246312
// take a write-lock.
247-
let state: EngineState = *self.state.read().await;
313+
let state: EngineStateInternal = **self.state.read().await;
248314

249-
// If this request just returned successfully but we don't think this node is
250-
// synced, check to see if it just became synced. This helps to ensure that the
251-
// networking stack can get fast feedback about a synced engine.
252-
if state != EngineState::Synced {
315+
// Keep an up to date engine state.
316+
if state != EngineStateInternal::Synced {
253317
// Spawn the upcheck in another task to avoid slowing down this request.
254318
let inner_self = self.clone();
255319
self.executor.spawn(
@@ -293,3 +357,22 @@ impl PayloadIdCacheKey {
293357
}
294358
}
295359
}
360+
361+
#[cfg(test)]
362+
mod tests {
363+
use super::*;
364+
use tokio_stream::StreamExt;
365+
366+
#[tokio::test]
367+
async fn test_state_notifier() {
368+
let mut state = State::default();
369+
let initial_state: EngineState = state.state.into();
370+
assert_eq!(initial_state, EngineState::Offline);
371+
state.update(EngineStateInternal::Synced);
372+
373+
// a watcher that arrives after the first update.
374+
let mut watcher = state.watch();
375+
let new_state = watcher.next().await.expect("Last state is always present");
376+
assert_eq!(new_state, EngineState::Online);
377+
}
378+
}

beacon_node/execution_layer/src/lib.rs

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ use builder_client::BuilderHttpClient;
1010
use engine_api::Error as ApiError;
1111
pub use engine_api::*;
1212
pub use engine_api::{http, http::deposit_methods, http::HttpJsonRpc};
13-
pub use engines::ForkChoiceState;
1413
use engines::{Engine, EngineError};
14+
pub use engines::{EngineState, ForkChoiceState};
1515
use fork_choice::ForkchoiceUpdateParameters;
1616
use lru::LruCache;
1717
use payload_status::process_payload_status;
@@ -31,6 +31,7 @@ use tokio::{
3131
sync::{Mutex, MutexGuard, RwLock},
3232
time::sleep,
3333
};
34+
use tokio_stream::wrappers::WatchStream;
3435
use types::{
3536
BlindedPayload, BlockType, ChainSpec, Epoch, ExecPayload, ExecutionBlockHash, ForkName,
3637
ProposerPreparationData, PublicKeyBytes, SignedBeaconBlock, Slot,
@@ -286,6 +287,13 @@ impl<T: EthSpec> ExecutionLayer<T> {
286287
self.inner.execution_blocks.lock().await
287288
}
288289

290+
/// Gives access to a channel containing if the last engine state is online or not.
291+
///
292+
/// This can be called several times.
293+
pub async fn get_responsiveness_watch(&self) -> WatchStream<EngineState> {
294+
self.engine().watch_state().await
295+
}
296+
289297
/// Note: this function returns a mutex guard, be careful to avoid deadlocks.
290298
async fn proposer_preparation_data(
291299
&self,

0 commit comments

Comments
 (0)