-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ledger-tool simulate-block-production #2733
Conversation
f1dbbd9
to
f965399
Compare
gossip/src/cluster_info.rs
Outdated
// `self.keypair.read().unwrap().pubkey()` is more straight-forward to use here. | ||
// However, self.keypair could be dummy in some very odd situation | ||
// (i.e. ledger-tool's simulate-leader-production). So, use `self.my_contact_info` here. | ||
// Other than the edge case, both are equivalent. | ||
*self.my_contact_info.read().unwrap().pubkey() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri please let me know if you disagree with this changes in gossip/src/cluster_info.rs
as justified by the source code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pretty much would rather avoid introducing scenarios wherecluster_info.keypair.pubkey() != contact_info.pubkey
.
Can you please provide more context why we need a ClusterInfo
with a contact_info
which we do not own the keypair
? To me this seems pretty error-prone and I would much prefer we try an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @behzadnouri here. It seems the reason we are adding this new inconsistency is because BankingStage
takes it as an argument.
It'd be better for us to refactor BankingStage
to not use ClusterInfo
imo.
We use ClusterInfo
for 2 things:
- Creating the
Forwarder
- could pass an
Option<Forwarder>
as arg toBankingStage
instead - ^ would need some changes to rip out mandatory forwarding in tlmi / voting threads
- alternatively could make forwarder an enum w/ disabled variant or even a trait instead
- could pass an
- Getting validator id for checking if we are leader
- easily can pass
Pubkey
instead
- easily can pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, my dcou based hack is unpopular.. ;) I did my part of little hassle. how about this?: 2b33131
Can you please provide more context why we need a
ClusterInfo
with acontact_info
which we do not own thekeypair
?
It seems the reason we are adding this new inconsistency is because
BankingStage
takes it as an argument.
@apfitzge 's understanding is correct. note that such broken ClusterInfo is only ever created under dcou code-path, though.
Getting validator id for checking if we are leader
* easily can pass `Pubkey` instead
Sadly, this isn't easy because the identity Pubkey
can be hot-swapped inside ClusterInfo
. That lead me to take this trait direction...:
- alternatively could make forwarder ... a trait instead
Also, note that BroadcastStageType
also needs ClusterInfo
. Fortunately, it seems the new hacky trait LikeClusterInfo
plumbing isn't needed for it.
That said, I wonder this additional production code is worth to maintain, only to support some obscure development ledger-tool
subcommand. Anyway, I'm not that opinionated. I just want to merge this pr.
f965399
to
6f75c2f
Compare
fyi, not included in this pr. but I now have some fancy charts (salvaged solana-labs#28119) at the development branch. namely, now that we can display the individual tx timings for each scheduler (quick legend: x axis is walltime; y axis is lined up by each threads; green arced arrows are read-lock dependency, pink arced arrows are write-lock dependency) thread-local-multi-iteratoreach banking thread is working as hard as like animals. you can indirectly see batch boundaries. central schedulermuch like to thread-local-multi-iterator, batched transactions show almost no gap (no overhead). while clipped, overall much less chaotic dep graph is observed. lastly, because stickiness of write lock to a particular thread, the 2nd batch is rather large and other threads are idle (see the 2nd pic) unified schedulerread locks are well parallelized. each task execution incurs large overhead, but dep graph resolution is rather timely. note that unified-scheduler is wired to the block production as well at #2325. That's why i have all the charts from the 3 impls... |
core/src/banking_simulation.rs
Outdated
info!( | ||
"jitter(parent_slot: {}): {}{:?} (sim: {:?} event: {:?})", | ||
old_slot, | ||
if elapsed_simulation_time > elapsed_event_time { | ||
"+" | ||
} else { | ||
"-" | ||
}, | ||
if elapsed_simulation_time > elapsed_event_time { | ||
elapsed_simulation_time - elapsed_event_time | ||
} else { | ||
elapsed_event_time - elapsed_simulation_time | ||
}, | ||
elapsed_simulation_time, | ||
elapsed_event_time, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this particular log output are like this:
$ grep -E 'jitter' simulate-mb.2024y08m26d10h44m43s933116486ns
[2024-08-26T10:58:57.989324285Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254383): +360.27µs (sim: 12.00036027s event: 12s)
[2024-08-26T10:58:58.344810251Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254384): -19.615829ms (sim: 12.355846357s event: 12.375462186s)
[2024-08-26T10:58:58.695797971Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254385): -71.903503ms (sim: 12.706834067s event: 12.77873757s)
[2024-08-26T10:58:59.047995512Z INFO solana_core::banking_simulation] jitter(parent_slot: 282254386): -135.656223ms (sim: 13.059031477s event: 13.1946877s)
in short, poh in sim is rather more timely than the actual traced poh recordings. maybe this is due to much reduced sysload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part.
Tried to fix up some grammar in the documenting comments, and some suggestions to split up some of the larger functions so its' easier for me to read.
core/src/banking_simulation.rs
Outdated
if let Some(event_time) = | ||
self.banking_trace_events.freeze_time_by_slot.get(&old_slot) | ||
{ | ||
if log_enabled!(log::Level::Info) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging functionality here should really get separated, it is quite long and distracting from the behavior of the loop.
core/src/banking_simulation.rs
Outdated
"{} isn't leader anymore at slot {}; new leader: {}", | ||
simulated_leader, new_slot, new_leader | ||
); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when we're no longer leader the process will end.
In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?
That probably adds considerable complexity, but I think would make simming significantly more useful.
AFAICT, as is we must load from snapshot everytime we want to do a sim of 4 slots - which will be very time consuming if I have hundred(s) of leader slots in my trace data that I'd like to simulate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, could we extend this capability so that we "fast-forward" through our non-leader periods?
That probably adds considerable complexity
indeed, it's possible. but with considerable complexity. Note that such "fast-forward"-ing needs to reload from snapshot... Just doing it without snapshot reloading would make most txes fail, invalidating the simulation itself.
I have hundred(s) of leader slots in my trace data that I'd like to simulate.
i know this is ideal. but dozen of simulation ledgers each with single round of the 4 leader slots is good enough..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think there's some complexity in how we'd need to "fast-forward" through non-leader periods, but don't think it'd require loading from a snapshot if done correctly.
Probably would add even more complexity, but could we not treat the simmed blocks as some sort of duplicate block (or a fork).
After each 4 slot sim, we drop the simmed blocks (duplicates/fork) for the actual blocks which we then replay as normal until we get close to next leader period.
That's what I had in mind for the fast-forward, since I definitely agree that we can't just continue on from the simmed blocks and act like things will just work from there haha.
If we were to do this would probably want to collaborate with ashwin or stevecz on how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that idea sounds nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how we could handle the "sim-swap" to remove simmed blocks and insert real blocks.
i think this just can be done with read-write side blockstore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, that idea sounds nice.
Cool! I think that'd really improve the usability, but we should definitely leave it for a follow-up PR, this one is big enough as is 😄
core/src/banking_simulation.rs
Outdated
ChannelLabel::GossipVote => &gossip_vote_sender, | ||
ChannelLabel::Dummy => unreachable!(), | ||
}; | ||
sender.send(batches_with_stats.clone()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why clone instead of lettting the sender thread take ownership of these batches?
just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the useful BTreeMap::range()
don't allow because it's not range_into
or something like that.
However, I noticed that I can use BTreeMap::split_off()
: 3a3131e
just reviewing on github, so can't see the type - is this Arc-ed, or just a clone of actual packet batch?
fyi, this is Arc-ed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
core/src/banking_simulation.rs
Outdated
} | ||
|
||
pub struct BankingTraceEvents { | ||
packet_batches_by_time: BTreeMap<SystemTime, (ChannelLabel, BankingPacketBatch)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a VecDeque
here? These are, afaik, always in order from the files (assuming we read the files in the correct order, which is easy enough to do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments: d3bf0d9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, related to this a bit, I noticed we rather should stop using BTreeMap::into_iter()
: 5e77bd7
@@ -504,6 +504,8 @@ impl PartialEq for Bank { | |||
if std::ptr::eq(self, other) { | |||
return true; | |||
} | |||
// Suppress rustfmt until https://github.com/rust-lang/rustfmt/issues/5920 is fixed ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun The figures in this comment #2733 (comment) raised a few questions for me
I think these questions are outside the scope of this PR - so maybe we do not focus on them here. Instead, I would like to ask about inspection of the simulated blocks: Are simulated blocks saved to blockstore? Is it possible for us to save them to a "separate" blockstore or parameterize it in some way? Ideally we could run simulation for several scheduler implementations, configurations, etc, and then run some analysis on the blocks produced after the fact so that we can compare them all. Basically it'd be really nice to add some block analysis stuff in ledger-tool, and then run that command in a bash loop to get some metrics about block "quality":
|
902dd5f
to
2b33131
Compare
core/Cargo.toml
Outdated
@@ -104,7 +104,7 @@ solana-ledger = { workspace = true, features = ["dev-context-only-utils"] } | |||
solana-logger = { workspace = true } | |||
solana-poh = { workspace = true, features = ["dev-context-only-utils"] } | |||
solana-program-runtime = { workspace = true } | |||
solana-runtime = { workspace = true, features = ["dev-context-only-utils"] } | |||
solana-runtime = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line completely...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 6d012c2
core/src/banking_simulation.rs
Outdated
|
||
impl LikeClusterInfo for DummyClusterInfo { | ||
fn id(&self) -> Pubkey { | ||
self.id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, intentionally wrap this with RwLock to mimic real ClusterInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 535f4da
yeah, i forgot to align the thread counts... I'll do in-depth comparison later. unified scheduler takes longer to clear the buffer when the thread count is very low like 4. that said, it scales well to saturate all of worker threads. here's a sample: also, now that unified scheduler is enabled for block verification, i think we should increase banking thread count to like 12-16.
indeed unified scheduler can't get rid of the curse of unbatched overhead, but i think it can be optimized for the greedy-leader-maximizing per-cu rewards. Currently, all non-contentious transactions are directly buffered to crossbeam channels with unbounded buffer depth in the unified scheduler as you know. but, I'm planning to place a priority queue for the freely-reorderable transactions in front of them at the scheduler thread side, while maintaining max of 1.5 * handler_thread_count of tasks are buffered by the crossbeam channels. In this way, higher-paying task reprioritization latency is about 1.5 * avg execution time of single transaction.
👍 anyway, i put some thought above.
yes.
yeah, we can do this easily. |
@apfitzge thanks for all the effort of code-reviewing. sans the general clean up of |
dee964a
to
284a1b6
Compare
fate of 1.4k lines of pr. ;) I was forced to rebase this pr onto #2172 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and given #2733 (review), think we can push this!
Problem
Even while solana-labs#29196 has landed and has been enabled for a while, there's no code to actually simulate the block production.
Summary of Changes
Finally, introduce the functionality called
agave-ledger-tool simulate-block-production
with following flags:On top of it, this pr also makes it possible to replay the simulated blocks later by persisting the shreds into the blockstore and adjusting the replay code-path a bit. Namely, the following flags are added:
(bike-shedding is welcome, btw...)
This pr is extracted from #2325
sample output (with the mainnet-beta ledger)
notice that (bank)
hash
es andlast_blockhash
es are overidden whileaccount_delta
(hashes) are different.While, tx counts largely differ from the actual log, both runs of simulation exhibits similar numbers, indicating rather stability of simulation.
The difference between simulation and reality is due to other general system loads, probably.
simulated log (1):
simulated log (2):
actual log: