Skip to content

Commit

Permalink
[fix] small refactoring to fix the usage of Network client in cluster…
Browse files Browse the repository at this point in the history
… tests (#13386)

## Description 

It seems after the merge of this PR
https://github.com/MystenLabs/sui/pull/13328/files the narwhal nightly
tests started failing
https://github.com/MystenLabs/sui/actions/runs/5828472064/job/15806247384

This PR is refactoring the cluster struct so we always create a new
network client and inject it to the primary/worker nodes to ensure that
we don't deal with any internal state mutations after shutdown (which
were preventing properly reusing the client on a new start).

## Test Plan 

How did you test the new or updated feature?

---
If your changes are not user-facing and not a breaking change, you can
skip the following section. Otherwise, please indicate what changed, and
then add to the Release Notes section as highlighted during the release
process.

### Type of Change (Check all that apply)

- [ ] protocol change
- [ ] user-visible impact
- [ ] breaking change for a client SDKs
- [ ] breaking change for FNs (FN binary must upgrade)
- [ ] breaking change for validators or node operators (must upgrade
binaries)
- [ ] breaking change for on-chain data layout
- [ ] necessitate either a data wipe or data migration

### Release notes
  • Loading branch information
akichidis authored Aug 14, 2023
1 parent 65ecb84 commit 54a4ab1
Show file tree
Hide file tree
Showing 3 changed files with 40 additions and 18 deletions.
2 changes: 1 addition & 1 deletion narwhal/primary/src/tests/rpc_tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ async fn test_server_authorizations() {
tokio::time::sleep(Duration::from_secs(3)).await;

let test_authority = test_cluster.authority(0);
let test_client = test_authority.client.clone();
let test_client = test_authority.client().await;
let test_committee = test_cluster.committee.clone();
let test_worker_cache = test_cluster.worker_cache.clone();

Expand Down
2 changes: 2 additions & 0 deletions narwhal/primary/tests/nodes_bootstrapping_tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ async fn test_node_staggered_starts() {
#[ignore]
#[tokio::test]
async fn test_full_outage_and_recovery() {
let _guard = setup_tracing();

let stop_and_start_delay = Duration::from_secs(12);
let node_advance_delay = Duration::from_secs(60);

Expand Down
54 changes: 37 additions & 17 deletions narwhal/test-utils/src/cluster.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ use prometheus::{proto::Metric, Registry};
use std::{cell::RefCell, collections::HashMap, path::PathBuf, rc::Rc, sync::Arc, time::Duration};
use storage::NodeStorage;
use telemetry_subscribers::TelemetryGuards;
use tokio::sync::RwLockWriteGuard;
use tokio::{
sync::{broadcast::Sender, mpsc::channel, RwLock},
task::JoinHandle,
Expand Down Expand Up @@ -502,11 +503,11 @@ pub struct AuthorityDetails {
pub id: usize,
pub name: AuthorityIdentifier,
pub public_key: PublicKey,
pub client: NetworkClient,
internal: Arc<RwLock<AuthorityDetailsInternal>>,
}

struct AuthorityDetailsInternal {
client: Option<NetworkClient>,
primary: PrimaryNodeDetails,
worker_keypairs: Vec<NetworkKeyPair>,
workers: HashMap<WorkerId, WorkerNodeDetails>,
Expand All @@ -523,9 +524,6 @@ impl AuthorityDetails {
committee: Committee,
worker_cache: WorkerCache,
) -> Self {
// Create network client.
let client = NetworkClient::new_from_keypair(&network_key_pair);

// Create all the nodes we have in the committee
let public_key = key_pair.public().clone();
let primary = PrimaryNodeDetails::new(
Expand Down Expand Up @@ -556,6 +554,7 @@ impl AuthorityDetails {
}

let internal = AuthorityDetailsInternal {
client: None,
primary,
worker_keypairs,
workers,
Expand All @@ -565,11 +564,19 @@ impl AuthorityDetails {
id,
public_key,
name,
client,
internal: Arc::new(RwLock::new(internal)),
}
}

pub async fn client(&self) -> NetworkClient {
let internal = self.internal.read().await;
internal
.client
.as_ref()
.expect("Requested network client which has not been initialised yet")
.clone()
}

/// Starts the node's primary and workers. If the num_of_workers is provided
/// then only those ones will be started. Otherwise all the available workers
/// will be started instead.
Expand All @@ -595,11 +602,9 @@ impl AuthorityDetails {
/// start with a fresh (empty) storage.
pub async fn start_primary(&self, preserve_store: bool) {
let mut internal = self.internal.write().await;
let client = self.create_client(&mut internal).await;

internal
.primary
.start(self.client.clone(), preserve_store)
.await;
internal.primary.start(client, preserve_store).await;
}

pub async fn stop_primary(&self) {
Expand All @@ -610,6 +615,8 @@ impl AuthorityDetails {

pub async fn start_all_workers(&self, preserve_store: bool) {
let mut internal = self.internal.write().await;
let client = self.create_client(&mut internal).await;

let worker_keypairs = internal
.worker_keypairs
.iter()
Expand All @@ -618,9 +625,7 @@ impl AuthorityDetails {

for (id, worker) in internal.workers.iter_mut() {
let keypair = worker_keypairs.get(*id as usize).unwrap().copy();
worker
.start(keypair, self.client.clone(), preserve_store)
.await;
worker.start(keypair, client.clone(), preserve_store).await;
}
}

Expand All @@ -630,15 +635,15 @@ impl AuthorityDetails {
/// start with a fresh (empty) storage.
pub async fn start_worker(&self, id: WorkerId, preserve_store: bool) {
let mut internal = self.internal.write().await;
let client = self.create_client(&mut internal).await;

let keypair = internal.worker_keypairs.get(id as usize).unwrap().copy();
let worker = internal
.workers
.get_mut(&id)
.unwrap_or_else(|| panic!("Worker with id {} not found ", id));

worker
.start(keypair, self.client.clone(), preserve_store)
.await;
worker.start(keypair, client, preserve_store).await;
}

pub async fn stop_worker(&self, id: WorkerId) {
Expand All @@ -654,9 +659,12 @@ impl AuthorityDetails {

/// Stops all the nodes (primary & workers).
pub async fn stop_all(&self) {
self.client.shutdown();
let mut internal = self.internal.write().await;
if let Some(client) = internal.client.as_ref() {
client.shutdown();
}
internal.client = None;

let internal = self.internal.read().await;
internal.primary.stop().await;
for (_, worker) in internal.workers.iter() {
worker.stop().await;
Expand Down Expand Up @@ -770,6 +778,18 @@ impl AuthorityDetails {
}
false
}

// Creates a new network client if there isn't one yet initialised.
async fn create_client(
&self,
internal: &mut RwLockWriteGuard<'_, AuthorityDetailsInternal>,
) -> NetworkClient {
if internal.client.is_none() {
let client = NetworkClient::new_from_keypair(&internal.primary.network_key_pair);
internal.client = Some(client);
}
internal.client.as_ref().unwrap().clone()
}
}

pub fn setup_tracing() -> TelemetryGuards {
Expand Down

2 comments on commit 54a4ab1

@vercel
Copy link

@vercel vercel bot commented on 54a4ab1 Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Successfully deployed to the following URLs:

sui-typescript-docs – ./sdk/docs

sui-typescript-docs.vercel.app
sui-typescript-docs-git-main-mysten-labs.vercel.app
sui-typescript-docs-mysten-labs.vercel.app

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 Validators 500/s Owned Transactions Benchmark Results

Benchmark Report:
+-------------+-----+-----+--------+---------------+---------------+---------------+-----------------------+----------------------------+
| duration(s) | tps | cps | error% | latency (min) | latency (p50) | latency (p99) | gas used (MIST total) | gas used/hr (MIST approx.) |
+=======================================================================================================================================+
| 60          | 601 | 601 | 0      | 35            | 8159          | 9335          | 456,952,896,000       | 27,417,173,760,000         |
Stress Performance Report:
+-----------+-----+-----+
| metric    | p50 | p99 |
+=======================+
| cpu usage | 13  | 100 |

4 Validators 500/s Shared Transactions Benchmark Results

Benchmark Report:
+-------------+-----+-----+--------+---------------+---------------+---------------+-----------------------+----------------------------+
| duration(s) | tps | cps | error% | latency (min) | latency (p50) | latency (p99) | gas used (MIST total) | gas used/hr (MIST approx.) |
+=======================================================================================================================================+
| 60          | 481 | 481 | 0      | 19            | 8887          | 14663         | 387,960,681,600       | 23,277,640,896,000         |
Stress Performance Report:
+-----------+-----+-----+
| metric    | p50 | p99 |
+=======================+
| cpu usage | 12  | 100 |

20 Validators 50/s Owned Transactions Benchmark Results

Benchmark Report:
+-------------+-----+-----+--------+---------------+---------------+---------------+-----------------------+----------------------------+
| duration(s) | tps | cps | error% | latency (min) | latency (p50) | latency (p99) | gas used (MIST total) | gas used/hr (MIST approx.) |
+=======================================================================================================================================+
| 60          | 199 | 199 | 0      | 22            | 65            | 93            | 149,731,488,000       | 8,983,889,280,000          |
Stress Performance Report:
+-----------+-----+-----+
| metric    | p50 | p99 |
+=======================+
| cpu usage | 26  | 49  |

20 Validators 50/s Shared Transactions Benchmark Results

Benchmark Report:
+-------------+-----+-----+--------+---------------+---------------+---------------+-----------------------+----------------------------+
| duration(s) | tps | cps | error% | latency (min) | latency (p50) | latency (p99) | gas used (MIST total) | gas used/hr (MIST approx.) |
+=======================================================================================================================================+
| 60          | 194 | 194 | 0      | 36            | 1408          | 2073          | 162,220,798,800       | 9,733,247,928,000          |
Stress Performance Report:
+-----------+-----+-----+
| metric    | p50 | p99 |
+=======================+
| cpu usage | 25  | 55  |

Narwhal Benchmark Results

 SUMMARY:
-----------------------------------------
 + CONFIG:
 Faults: 0 node(s)
 Committee size: 4 node(s)
 Worker(s) per node: 1 worker(s)
 Collocate primary and workers: True
 Input rate: 50,000 tx/s
 Transaction size: 512 B
 Execution time: 0 s

 Header number of batches threshold: 32 digests
 Header maximum number of batches: 1,000 digests
 Max header delay: 2,000 ms
 GC depth: 50 round(s)
 Sync retry delay: 10,000 ms
 Sync retry nodes: 3 node(s)
 batch size: 500,000 B
 Max batch delay: 200 ms
 Max concurrent requests: 500,000 

 + RESULTS:
 Batch creation avg latency: 202 ms
 Header creation avg latency: -1 ms
 	Batch to header avg latency: -1 ms
 Header to certificate avg latency: 1 ms
 	Request vote outbound avg latency: 0 ms
 Certificate commit avg latency: 707 ms

 Consensus TPS: 0 tx/s
 Consensus BPS: 0 B/s
 Consensus latency: 0 ms

 End-to-end TPS: 0 tx/s
 End-to-end BPS: 0 B/s
 End-to-end latency: 0 ms
-----------------------------------------

Please sign in to comment.