Skip to content

Commit

Permalink
Migrate Narwhal failpoints to simtest, add more (MystenLabs#10458)
Browse files Browse the repository at this point in the history
## Description 

- Adds failpoints for all storage operations and inbound RPCs.
- Adds delays in Certifier.

## Test Plan 

Is tests.
  • Loading branch information
aschran authored and wlmyng committed Apr 6, 2023
1 parent 300051e commit 7376112
Show file tree
Hide file tree
Showing 18 changed files with 106 additions and 115 deletions.
25 changes: 6 additions & 19 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 35 additions & 2 deletions crates/sui-benchmark/tests/simtest.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#[cfg(msim)]
mod test {

use rand::{thread_rng, Rng};
use rand::{distributions::uniform::SampleRange, thread_rng, Rng};
use std::str::FromStr;
use std::sync::{Arc, Mutex};
use std::time::{Duration, Instant};
Expand Down Expand Up @@ -137,8 +137,26 @@ mod test {
}
}

async fn delay_failpoint<R>(range_ms: R, probability: f64)
where
R: SampleRange<u64>,
{
let duration = {
let mut rng = thread_rng();
if rng.gen_range(0.0..1.0) < probability {
info!("Matched probability threshold for delay failpoint. Delaying...");
Some(Duration::from_millis(rng.gen_range(range_ms)))
} else {
None
}
};
if let Some(duration) = duration {
tokio::time::sleep(duration).await;
}
}

#[sim_test(config = "test_config()")]
async fn test_simulated_load_reconfig_crashes() {
async fn test_simulated_load_reconfig_with_crashes_and_delays() {
sui_protocol_config::ProtocolConfig::poison_get_for_min_version();
let test_cluster = build_test_cluster(4, 1000).await;

Expand Down Expand Up @@ -169,6 +187,21 @@ mod test {
handle_failpoint(dead_validator.clone(), client_node, 0.01);
}
});

// Narwhal fail points.
let dead_validator = dead_validator_orig.clone();
register_fail_points(
&[
"narwhal-rpc-response",
"narwhal-store-before-write",
"narwhal-store-after-write",
],
move || {
handle_failpoint(dead_validator.clone(), client_node, 0.001);
},
);
register_fail_point_async("narwhal-delay", || delay_failpoint(10..20, 0.001));

test_simulated_load(test_cluster, 120).await;
}

Expand Down
6 changes: 2 additions & 4 deletions crates/workspace-hack/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -204,8 +204,7 @@ ethnum = { version = "1", default-features = false }
event-listener = { version = "2", default-features = false }
expect-test = { version = "1", default-features = false }
eyre = { version = "0.6" }
fail-9fbad63c4bcf4a8f = { package = "fail", version = "0.4", default-features = false }
fail-d8f496e17d97b5cb = { package = "fail", version = "0.5", default-features = false }
fail = { version = "0.4", default-features = false }
fallible-iterator = { version = "0.2" }
fastcrypto = { git = "https://github.com/MystenLabs/fastcrypto", rev = "84fd7c7428c5f59185aecc56a2e0a006e8e07de1", features = ["copy_key"] }
fastcrypto-zkp = { git = "https://github.com/MystenLabs/fastcrypto", rev = "84fd7c7428c5f59185aecc56a2e0a006e8e07de1", default-features = false }
Expand Down Expand Up @@ -896,8 +895,7 @@ ethnum = { version = "1", default-features = false }
event-listener = { version = "2", default-features = false }
expect-test = { version = "1", default-features = false }
eyre = { version = "0.6" }
fail-9fbad63c4bcf4a8f = { package = "fail", version = "0.4", default-features = false }
fail-d8f496e17d97b5cb = { package = "fail", version = "0.5", default-features = false }
fail = { version = "0.4", default-features = false }
fallible-iterator = { version = "0.2" }
fastcrypto = { git = "https://github.com/MystenLabs/fastcrypto", rev = "84fd7c7428c5f59185aecc56a2e0a006e8e07de1", features = ["copy_key"] }
fastcrypto-derive = { git = "https://github.com/MystenLabs/fastcrypto", rev = "84fd7c7428c5f59185aecc56a2e0a006e8e07de1", default-features = false }
Expand Down
13 changes: 0 additions & 13 deletions narwhal/benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -348,16 +348,3 @@ plot_params = {
The first graph ('latency') plots the latency versus the throughput. It shows that the latency is low until a fairly neat threshold after which it drastically increases. Determining this threshold is crucial to understanding the limits of the system.

Another challenge is comparing apples-to-apples between different deployments of the system. The challenge here is again that latency and throughput are interdependent, as a result a throughput/number of nodes chart could be tricky to produce fairly. The way to do it is to define a maximum latency and measure the throughput at this point instead of simply pushing every system to its peak throughput (where latency is meaningless). The second graph ('tps') plots the maximum achievable throughput under a maximum latency for different numbers of nodes.

## Benchmarks with failpoints

[Fail points](https://docs.rs/failpoints/latest/failpoints/) are code instrumentations that allow errors and other behavior to be injected dynamically at runtime, primarily for testing purposes. Fail points are flexible and can be configured to exhibit a variety of behavior, including panics, early returns, and sleeping. They can be controlled both programmatically and via the environment, and can be triggered conditionally and probabilistically.

Search the code base for `fail_point!()` macro to see the list instrumented failpoints.

Example usage

```
$ FAILPOINTS='rpc-response-delay=5%sleep(10000);request-batch=5%return;report-our-batch=5%return;request-vote=5%return;certificate-store-panic=.01%return;certificate-store=5%return' \
fab failpoints
```
2 changes: 1 addition & 1 deletion narwhal/network/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ parking_lot = "0.12.1"
quinn-proto = "^0.9.2"
prometheus = "0.13.3"
rand = { version = "0.8.5", features = ["small_rng"] }
sui-macros = { path = "../../crates/sui-macros" }
tokio = { workspace = true, features = ["rt", "net", "sync", "macros", "time"] }
tracing = "0.1.36"
types = { path = "../types", package = "narwhal-types" }
Expand All @@ -31,7 +32,6 @@ anyhow = "1.0.65"
axum.workspace = true
axum-server = "0.4.2"
tower = "0.4.13"
fail = "0.5.1"

[dev-dependencies]
bincode = "1.3.3"
Expand Down
8 changes: 2 additions & 6 deletions narwhal/network/src/failpoints.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
// Copyright (c) Mysten Labs, Inc.
// SPDX-License-Identifier: Apache-2.0
use anemo_tower::callback::{MakeCallbackHandler, ResponseHandler};
use fail::fail_point;
use sui_macros::fail_point;

#[derive(Clone, Default)]
pub struct FailpointsMakeCallbackHandler {}
Expand All @@ -24,11 +24,7 @@ pub struct FailpointsResponseHandler {}

impl ResponseHandler for FailpointsResponseHandler {
fn on_response(self, _response: &anemo::Response<bytes::Bytes>) {
// TODO: Use tokio::sleep() instead of built in sleep()/delay()
// Warning: if this failpoint is used with the default sleep()
// or delay() it could end up blocking the system and causing other
// unintended effects.
fail_point!("rpc-response-delay");
fail_point!("narwhal-rpc-response");
}

fn on_error<E>(self, _error: &E) {}
Expand Down
6 changes: 0 additions & 6 deletions narwhal/network/src/p2p.rs
Original file line number Diff line number Diff line change
Expand Up @@ -216,12 +216,6 @@ impl WorkerRpc for anemo::Network {

let peer_id = PeerId(peer.0.to_bytes());

fail::fail_point!("request-batch", |_| {
Err(format_err!(
"Injected error in request batch from peer {peer_id}"
))
});

let peer = self
.peer(peer_id)
.ok_or_else(|| format_err!("Network has no connection with peer {peer_id}"))?;
Expand Down
1 change: 0 additions & 1 deletion narwhal/node/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ workspace-hack = { version = "0.1", path = "../../crates/workspace-hack" }

anemo.workspace = true
reqwest = { version = "0.11.13", default_features= false, features = ["json", "rustls-tls"] }
fail = "0.5.1"

[dev-dependencies]
pretty_assertions = "1.3.0"
Expand Down
9 changes: 0 additions & 9 deletions narwhal/node/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -248,15 +248,6 @@ async fn run(
worker_keypair: NetworkKeyPair,
registry: Registry,
) -> Result<(), eyre::Report> {
// Only enabled if failpoints feature flag is set
let _failpoints_scenario: fail::FailScenario<'_>;
if fail::has_failpoints() {
warn!("Failpoints are enabled");
_failpoints_scenario = fail::FailScenario::setup();
} else {
info!("Failpoints are not enabled");
}

let workers_file = matches.value_of("workers").unwrap();
let parameters_file = matches.value_of("parameters");
let store_path = matches.value_of("store").unwrap();
Expand Down
2 changes: 1 addition & 1 deletion narwhal/primary/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ network = { path = "../network", package = "narwhal-network" }
types = { path = "../types", package = "narwhal-types" }
storage = { path = "../storage", package = "narwhal-storage" }
store = { path = "../../crates/typed-store", package = "typed-store" }
sui-macros = { path = "../../crates/sui-macros" }
mysten-network.workspace = true
workspace-hack = { version = "0.1", path = "../../crates/workspace-hack" }

Expand All @@ -46,7 +47,6 @@ mysten-metrics = { path = "../../crates/mysten-metrics" }

anemo.workspace = true
anemo-tower.workspace = true
fail = "0.5.1"

[dev-dependencies]
dashmap = "5.4.0"
Expand Down
9 changes: 3 additions & 6 deletions narwhal/primary/src/certifier.rs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ use network::anemo_ext::NetworkExt;
use std::sync::Arc;
use std::time::Duration;
use storage::{CertificateStore, HeaderStore};
use sui_macros::fail_point_async;
use tokio::{
sync::oneshot,
task::{JoinHandle, JoinSet},
Expand Down Expand Up @@ -129,12 +130,6 @@ impl Certifier {
let peer_id = anemo::PeerId(target.0.to_bytes());
let peer = network.waiting_peer(peer_id);

fail::fail_point!("request-vote", |_| {
Err(DagError::NetworkError(format!(
"Injected error in request vote for {header}"
)))
});

let mut client = PrimaryToPrimaryClient::new(peer);

let mut missing_parents: Vec<CertificateDigest> = Vec::new();
Expand Down Expand Up @@ -388,6 +383,7 @@ impl Certifier {
let signature_service = self.signature_service.clone();
let metrics = self.metrics.clone();
let network = self.network.clone();
fail_point_async!("narwhal-delay");
self.propose_header_tasks.spawn(monitored_future!(Self::propose_header(
name,
committee,
Expand All @@ -407,6 +403,7 @@ impl Certifier {
Some(result) = self.propose_header_tasks.join_next() => {
match result {
Ok(Ok(certificate)) => {
fail_point_async!("narwhal-delay");
self.synchronizer.accept_own_certificate(certificate).await
},
Ok(Err(e)) => Err(e),
Expand Down
7 changes: 0 additions & 7 deletions narwhal/primary/src/primary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1122,13 +1122,6 @@ impl WorkerToPrimary for WorkerReceiverHandler {
) -> Result<anemo::Response<()>, anemo::rpc::Status> {
let message = request.into_body();

fail::fail_point!("report-our-batch", |_| {
Err(anemo::rpc::Status::internal(format!(
"Injected error in report our batch from worker_id {}",
message.worker_id
)))
});

let (tx_ack, rx_ack) = oneshot::channel();
let response = self
.tx_our_digests
Expand Down
2 changes: 1 addition & 1 deletion narwhal/storage/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ tracing = "0.1.36"
crypto = { path = "../crypto", package = "narwhal-crypto" }
types = { path = "../types", package = "narwhal-types" }
store = { path = "../../crates/typed-store", package = "typed-store" }
sui-macros = { path = "../../crates/sui-macros" }
config = { path = "../config", package = "narwhal-config" }
workspace-hack = { version = "0.1", path = "../../crates/workspace-hack" }
prometheus = "0.13.3"
fail = "0.5.1"
lru = "0.10"
parking_lot = "0.12.1"
tap = "1.0.1"
Expand Down
Loading

0 comments on commit 7376112

Please sign in to comment.