Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed bench #1661

Merged
merged 24 commits into from
May 3, 2022
Merged

Distributed bench #1661

merged 24 commits into from
May 3, 2022

Conversation

oxade
Copy link
Contributor

@oxade oxade commented Apr 28, 2022

This PR allows us to provision load generators running anywhere.
These load gens can then be used to benchmark validators.
There are many intertwined parts of the PR, but at the core, here's how it works:

  1. Generate bench configs for a bunch of IPs and a bunch of load gens
./bench_configure --host-port-stake-triplets "0.0.0.1:5000:1" "1.1.1.1:5000:1" "2.2.2.2:5000:1" "3.3.3.3:5000:1" --number-of-generators 12

This generates config files for the validators (genesis.conf) and for load gens.

ubuntu@ip-172-31-82-122:~/sui/target/release$ ls *.conf
distributed_bench_genesis_0.conf  distributed_bench_genesis_2.conf  load_gen_0.conf  load_gen_10.conf  load_gen_2.conf  load_gen_4.conf  load_gen_6.conf  load_gen_8.conf
distributed_bench_genesis_1.conf  distributed_bench_genesis_3.conf  load_gen_1.conf  load_gen_11.conf  load_gen_3.conf  load_gen_5.conf  load_gen_7.conf  load_gen_9.conf
  1. Copy the files to the machines
    load_gen_x.conf goes to load gen machine x
    distributed_bench_genesis_y.conf goes to validator machine y

  2. Start the validators
    Example

./validator --genesis-config-path distributed_bench_genesis_0.conf --force-genesis --address d22fbff52bccd733067cf8304cb4bfde8355588c  --listen-address 0.0.0.0:5000
  1. Run the load gens
    Example
./remote_load_generator --remote-config load_gen_0.conf --chunk-size 800 --period-us 100000 --num-chunks 500

I typically run one load gen as a probe when measuring latency

  1. Once the load gens exhaust their load, they will dump the results.

Few other minor features:

  1. Bulk load objects into authorities in genesis?
  2. Genesis config objects by object id range
  3. Optionally only provision one validator in genesis
  4. use-move is now default

Next steps:
0. Finalize script to automate this process. Due to recent breaking config changes, the old script needs to be improved

  1. Improve quorum logic measurement
  2. Add utility for generating plots
  3. Improve usage of gRPC
  4. Add benchmarks for reads

@oxade oxade requested a review from longbowlu April 28, 2022 21:42
@oxade oxade marked this pull request as draft April 28, 2022 21:43
@oxade oxade changed the title WIP: Distributed bench Distributed bench May 3, 2022
@oxade oxade marked this pull request as ready for review May 3, 2022 04:06
@oxade oxade requested review from patrickkuo and velvia May 3, 2022 07:09
@oxade oxade marked this pull request as draft May 3, 2022 07:13
@oxade oxade marked this pull request as ready for review May 3, 2022 07:52
@oxade oxade requested review from bmwill and gdanezis May 3, 2022 07:58
Comment on lines +83 to +95
impl AuthorityPrivateInfo {
pub fn copy(&self) -> Self {
Self {
address: self.address,
host: self.host.clone(),
port: self.port,
db_path: self.db_path.clone(),
stake: self.stake,
consensus_address: self.consensus_address,
public_key: self.public_key,
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure i understand the need for this over using clone

Copy link
Contributor Author

@oxade oxade May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is a vestige from when AuthorityPrivateInfo had keypairs, which were not clone-able.
I forgot to remove this when I merged with main, which removes keypairs recently.
Will delete this.

Copy link
Collaborator

@gdanezis gdanezis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. You can simplify quite a bit the sending and receiving logic as suggested if you want.

@@ -553,6 +553,66 @@ impl ObjectID {
.map_err(|_| ObjectIDParseError::TryFromSliceError)
.map(ObjectID::from)
}

/// Incremenent the ObjectID by usize IDs, assuming the ObjectID hex is a number represented as an array of bytes
pub fn advance(&self, step: usize) -> Result<ObjectID, anyhow::Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not something we want to expose anywhere away to where it is used, and even there -- why are we doing this? Could we: (1) generate IDs randomly or (2) generate IDs using a u64 counter and https://doc.rust-lang.org/beta/std/primitive.u64.html#method.to_be_bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random ID generation will lead to huge genesis config files. Ranges allow me compress and partition load gens to only work on specific objects.

So an entry of 1 million object ids looks like this instead of 1M lines of text

    {
      "address": "57ce23f7b759259f2831500f637baa59a15a7023",
      "gas_objects": [],
      "gas_object_ranges": [
        {
          "offset": "00000000000000000000000000100000007a1200",
          "count": 1000000,
          "gas_value": 18446744073709551615
        }
      ]
    },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right. I should try to localize it to the files where its used

@@ -664,6 +665,12 @@ impl AuthorityState {
.expect("TODO: propagate the error")
}

pub async fn insert_genesis_objects_bulk_unsafe(&self, objects: &[&Object]) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a comment this must not be used away from bench / test code. I would even feel better if we use something to not make it visible outside this context.

tick_period_us,
latencies,
} => {
let tracer_avg = latencies.iter().sum::<u128>() as f64 / latencies.len() as f64;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could report here the p10 p90 or std?

notif.notified().await;
let r = send_tx_chunks(tx_chunk, net_client.clone(), conn).await;

match result_chann_tx.send((r.0, stake)).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about if let Err(...) = ...


let _: Vec<_> =
r.1.par_iter()
.map(|q| check_transaction_response(deserialize_message(&(q.as_ref().unwrap())[..])))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make things fail or print errors is there are errors here? (althrough I think the check_transaction_response does that?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_transaction_response already does that

// Objects for payment
let next_offset = objects[objects.len() - 1].id();

ObjectID::in_range(next_offset.next_increment().unwrap(), tx_count as u64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not just create random object IDs? Or is the issue that they need to exist over all authorities so we need a deterministic generation process? This arithmetic on object IDs is very error prone and misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I have the range arithmetic is so that load gen TXes do not clash. I want them to operate in specific object id ranges. This makes debugging much easier

account_private_info.push((account_keypair, obj_id_offset));

// Ensure no overlap
obj_id_offset = obj_id_offset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here again, can we avoid doing arithmetic on object IDs and just generate them at random? I assume this config happens centrally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we generate a bunch of random object ids, we end up with a genesis config file millions of lines long which then has to be shipped around and parsed by each validator. It slows down startup time.
My solution keeps this simply by using clearly defined ranges

@@ -69,7 +69,7 @@ pub struct AuthorityInfo {
pub base_port: u16,
}

#[derive(Serialize, Debug)]
#[derive(Serialize, Debug, Clone)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is no more the authority private info, since it has no private key? Traditionally we have tried to keep stuff with private keys non-Clone, but this is not an issue any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will rename

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should further clean "private" info out such as the db_path

@@ -80,6 +80,19 @@ pub struct AuthorityPrivateInfo {
pub consensus_address: SocketAddr,
}

impl AuthorityPrivateInfo {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is clone, do we need an explicit copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a vestige from when AuthorityPrivateInfo had keypairs, which were not clone-able.
I forgot to remove this when I merged with main, which removes keypairs recently.
Will delete this.

@@ -380,11 +383,15 @@ impl SuiNetwork {

pub async fn genesis(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separate the bench genesis from a clean genesis? It seems we are overreaching into production code quite a bit here to support bench.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pondered this too, but this means we will have a special validator flow for benchmarking and a different one for prod.

Is this a road we want to go down?

Copy link
Contributor

@longbowlu longbowlu May 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually don't need this new parameter single_address for genesis, the GenesisConfig has a keypair struct that is supposed to represent the single_address

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue: #1758

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thoughts, I think we could have different handling for normal genesis and benchmark genesis. They are all the same with the only exception of populating pre-generated objects. We do this by giving different genesis config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oxade
Copy link
Contributor Author

oxade commented May 3, 2022

Will have a fast follow enhancement PR to address the issues raised in this PR
Thanks guys

@oxade oxade merged commit 5c44cd2 into main May 3, 2022
@oxade oxade deleted the distributed_bench branch May 3, 2022 16:21
@@ -380,11 +383,15 @@ impl SuiNetwork {

pub async fn genesis(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created an issue: #1758

Comment on lines +295 to +297
// Confirmation step
let (conf_chann_tx, mut conf_chann_rx) = MpscChannel(net_clients.len() * 2);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is an opportunity to extract the common logic (between intent & conf) into a function

let db_path = format!("DB_{}", validator_address);
let path = Path::new(&db_path);

let host_bytes: Vec<u8> = host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let host = tokens[0].clone();

#[allow(clippy::needless_collect)]
let host_bytes = host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing for from_str

@@ -69,7 +69,7 @@ pub struct AuthorityInfo {
pub base_port: u16,
}

#[derive(Serialize, Debug)]
#[derive(Serialize, Debug, Clone)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should further clean "private" info out such as the db_path

@@ -380,11 +383,15 @@ impl SuiNetwork {

pub async fn genesis(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some thoughts, I think we could have different handling for normal genesis and benchmark genesis. They are all the same with the only exception of populating pre-generated objects. We do this by giving different genesis config.

@longbowlu
Copy link
Contributor

Looking good! Echoing with @gdanezis , we can split out benchmarking-specific stuff out of prod. Created a couple of issues to follow up

longbowlu pushed a commit that referenced this pull request May 12, 2022
* Remote benchmarking
punwai pushed a commit that referenced this pull request Jul 27, 2022
* Remote benchmarking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants