-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed bench #1661
Distributed bench #1661
Conversation
impl AuthorityPrivateInfo { | ||
pub fn copy(&self) -> Self { | ||
Self { | ||
address: self.address, | ||
host: self.host.clone(), | ||
port: self.port, | ||
db_path: self.db_path.clone(), | ||
stake: self.stake, | ||
consensus_address: self.consensus_address, | ||
public_key: self.public_key, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure i understand the need for this over using clone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is a vestige from when AuthorityPrivateInfo
had keypairs, which were not clone-able.
I forgot to remove this when I merged with main, which removes keypairs recently.
Will delete this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. You can simplify quite a bit the sending and receiving logic as suggested if you want.
@@ -553,6 +553,66 @@ impl ObjectID { | |||
.map_err(|_| ObjectIDParseError::TryFromSliceError) | |||
.map(ObjectID::from) | |||
} | |||
|
|||
/// Incremenent the ObjectID by usize IDs, assuming the ObjectID hex is a number represented as an array of bytes | |||
pub fn advance(&self, step: usize) -> Result<ObjectID, anyhow::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not something we want to expose anywhere away to where it is used, and even there -- why are we doing this? Could we: (1) generate IDs randomly or (2) generate IDs using a u64 counter and https://doc.rust-lang.org/beta/std/primitive.u64.html#method.to_be_bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random ID generation will lead to huge genesis config files. Ranges allow me compress and partition load gens to only work on specific objects.
So an entry of 1 million object ids looks like this instead of 1M lines of text
{
"address": "57ce23f7b759259f2831500f637baa59a15a7023",
"gas_objects": [],
"gas_object_ranges": [
{
"offset": "00000000000000000000000000100000007a1200",
"count": 1000000,
"gas_value": 18446744073709551615
}
]
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you're right. I should try to localize it to the files where its used
@@ -664,6 +665,12 @@ impl AuthorityState { | |||
.expect("TODO: propagate the error") | |||
} | |||
|
|||
pub async fn insert_genesis_objects_bulk_unsafe(&self, objects: &[&Object]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets add a comment this must not be used away from bench / test code. I would even feel better if we use something to not make it visible outside this context.
tick_period_us, | ||
latencies, | ||
} => { | ||
let tracer_avg = latencies.iter().sum::<u128>() as f64 / latencies.len() as f64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could report here the p10 p90 or std?
notif.notified().await; | ||
let r = send_tx_chunks(tx_chunk, net_client.clone(), conn).await; | ||
|
||
match result_chann_tx.send((r.0, stake)).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about if let Err(...) = ...
|
||
let _: Vec<_> = | ||
r.1.par_iter() | ||
.map(|q| check_transaction_response(deserialize_message(&(q.as_ref().unwrap())[..]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make things fail or print errors is there are errors here? (althrough I think the check_transaction_response
does that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_transaction_response
already does that
// Objects for payment | ||
let next_offset = objects[objects.len() - 1].id(); | ||
|
||
ObjectID::in_range(next_offset.next_increment().unwrap(), tx_count as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not just create random object IDs? Or is the issue that they need to exist over all authorities so we need a deterministic generation process? This arithmetic on object IDs is very error prone and misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I have the range arithmetic is so that load gen TXes do not clash. I want them to operate in specific object id ranges. This makes debugging much easier
account_private_info.push((account_keypair, obj_id_offset)); | ||
|
||
// Ensure no overlap | ||
obj_id_offset = obj_id_offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again, can we avoid doing arithmetic on object IDs and just generate them at random? I assume this config happens centrally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we generate a bunch of random object ids, we end up with a genesis config file millions of lines long which then has to be shipped around and parsed by each validator. It slows down startup time.
My solution keeps this simply by using clearly defined ranges
@@ -69,7 +69,7 @@ pub struct AuthorityInfo { | |||
pub base_port: u16, | |||
} | |||
|
|||
#[derive(Serialize, Debug)] | |||
#[derive(Serialize, Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is no more the authority private info, since it has no private key? Traditionally we have tried to keep stuff with private keys non-Clone, but this is not an issue any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will rename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should further clean "private" info out such as the db_path
@@ -80,6 +80,19 @@ pub struct AuthorityPrivateInfo { | |||
pub consensus_address: SocketAddr, | |||
} | |||
|
|||
impl AuthorityPrivateInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is clone, do we need an explicit copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a vestige from when AuthorityPrivateInfo had keypairs, which were not clone-able.
I forgot to remove this when I merged with main, which removes keypairs recently.
Will delete this.
@@ -380,11 +383,15 @@ impl SuiNetwork { | |||
|
|||
pub async fn genesis( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we separate the bench genesis from a clean genesis? It seems we are overreaching into production code quite a bit here to support bench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pondered this too, but this means we will have a special validator flow for benchmarking and a different one for prod.
Is this a road we want to go down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we actually don't need this new parameter single_address
for genesis
, the GenesisConfig
has a keypair
struct that is supposed to represent the single_address
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue: #1758
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some thoughts, I think we could have different handling for normal genesis and benchmark genesis. They are all the same with the only exception of populating pre-generated objects. We do this by giving different genesis config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a fast follow enhancement PR to address the issues raised in this PR |
@@ -380,11 +383,15 @@ impl SuiNetwork { | |||
|
|||
pub async fn genesis( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created an issue: #1758
// Confirmation step | ||
let (conf_chann_tx, mut conf_chann_rx) = MpscChannel(net_clients.len() * 2); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an opportunity to extract the common logic (between intent & conf) into a function
let db_path = format!("DB_{}", validator_address); | ||
let path = Path::new(&db_path); | ||
|
||
let host_bytes: Vec<u8> = host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let host = tokens[0].clone(); | ||
|
||
#[allow(clippy::needless_collect)] | ||
let host_bytes = host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing for from_str
@@ -69,7 +69,7 @@ pub struct AuthorityInfo { | |||
pub base_port: u16, | |||
} | |||
|
|||
#[derive(Serialize, Debug)] | |||
#[derive(Serialize, Debug, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should further clean "private" info out such as the db_path
@@ -380,11 +383,15 @@ impl SuiNetwork { | |||
|
|||
pub async fn genesis( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some thoughts, I think we could have different handling for normal genesis and benchmark genesis. They are all the same with the only exception of populating pre-generated objects. We do this by giving different genesis config.
Looking good! Echoing with @gdanezis , we can split out benchmarking-specific stuff out of prod. Created a couple of issues to follow up |
This PR allows us to provision load generators running anywhere.
These load gens can then be used to benchmark validators.
There are many intertwined parts of the PR, but at the core, here's how it works:
This generates config files for the validators (genesis.conf) and for load gens.
Copy the files to the machines
load_gen_
x
.conf goes to load gen machinex
distributed_bench_genesis_
y
.conf goes to validator machiney
Start the validators
Example
Example
I typically run one load gen as a probe when measuring latency
Few other minor features:
use-move
is now defaultNext steps:
0. Finalize script to automate this process. Due to recent breaking config changes, the old script needs to be improved