-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Conversation
Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy.
struct SlotFlags: u8 { | ||
const DEAD = 0b00000001; | ||
const FULL = 0b00000010; | ||
const ROOTED = 0b00000100; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda flags to track if the slot is ultimately marked as dead or rooted, ...
Currenlty DEAD
and ROOTED
are unused.
#[derive(Default)] | ||
struct SlotStats { | ||
flags: SlotFlags, | ||
num_repaired: usize, | ||
num_recovered: usize, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda This struct can be expanded with stats for each erasure batch.
let (_slot, _entry) = self.0.pop_lru().unwrap(); | ||
// TODO: submit metrics for (slot, entry). | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda Here we can emit metrics with respect to erasure batches in the slot.
self.maybe_evict_cache(); | ||
} | ||
|
||
pub(crate) fn set_full(&mut self, slot_meta: &SlotMeta) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbiseda need similar functions for set_dead
or set_rooted
.
Don't need to emit metrics in those methods until maybe_evict_cache
.
I've done something similar for changes to #23725 merging the FEC set index counts with the existing |
This commit is only changing the lifetime of |
macro_rules! get_mut_entry ( | ||
($cache:expr, $key:expr) => ( | ||
match $cache.get_mut(&$key) { | ||
Some(entry) => entry, | ||
None => { | ||
$cache.put($key, SlotStats::default()); | ||
$cache.get_mut(&$key).unwrap() | ||
} | ||
} | ||
); | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any benefit of making this a macro vs implementing it as a method in SlotsStats
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do prefer to ditch the macro.
But I couldn't get it compile as a function, because of lifetime errors.
Feel free to give it a shot, and let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this worked for me:
fn get_or_default_with_eviction_check<'a>(
stats: &'a mut MutexGuard<LruCache<Slot, SlotStats>>,
slot: Slot,
) -> (&'a mut SlotStats, Option<(Slot, SlotStats)>) {
let mut evicted = None;
if stats.len() == stats.cap() {
match stats.peek_lru() {
Some((s, _)) if *s == slot => (),
_ => {
evicted = Some(stats.pop_lru().expect("cache capacity should not be 0"));
}
}
}
stats.get_or_insert(slot, SlotStats::default);
(stats.get_mut(&slot).unwrap(), evicted)
}
|
||
bitflags! { | ||
#[derive(Default)] | ||
struct SlotFlags: u8 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rename to SlotState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are already SlotStats
and SlotsStats
in this file.
So I guess SlotState
would be bit confusing and easy to misread here.
Also we are using "flags" elsewhere in the code; e.g.
https://github.com/solana-labs/solana/blob/91c272985/sdk/src/packet.rs#L17-L36
https://github.com/solana-labs/solana/blob/91c272985/ledger/src/shred.rs#L215
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about SlotStats -> SlotMetrics
and then SlotFlags -> SlotState
Looks like this was an issue even before this PR, but solana/ledger/src/blockstore.rs Line 1427 in 81d88f0
Blockstore::slots_stats as well
|
With this new code, why the |
|
("num_repaired", entry.num_repaired, i64), | ||
("num_recovered", entry.num_recovered, i64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this slot was evicted from the cache and re-inserted later it's possible these two metrics could be incorrect right? In these cases should we mark the entry so that we report some default value like -1 for these entries to distinguish them from the other entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well at some point we need to evict -1 values as well.
Turbine is one-shot attempt, and with capacity of 300 slots (= 2 minutes) I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.
If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.
So this will happen if we have >300
slots that received a shred but aren't full right? Referencing the discussion here: #23872 (comment)
It is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively >receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from >turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair.
I think this is more likely the further you are behind the tip because there are more slots near the tip that you will not choose to repair fully (repair always tries to fully repair slots connected to the root first)
If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.
Hmmm, we don't report currently on eviction right, is that this TODO here:
solana/ledger/src/slot_stats.rs
Line 87 in 81d88f0
// TODO: submit metrics for (slot, entry). |
We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough
Currently what would be the best way to tell from the metrics that this is happening? It seems we would report a datapoint whether or not the slots were being evicted from the cache.
What if we add a "num_turbine" here:
solana/ledger/src/slot_stats.rs
Line 55 in 81d88f0
ShredSource::Turbine => (), |
Then we know if num_turbine
+ num_repaired
+ num_recovered
!= last_index + 1 it must have been evicted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was logging whenever an entry is removed from the cache and included bool values for rooted/dead/evicted/pruned so these entries could be filtered by a WHERE
clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
crate::blockstore_meta::SlotMeta, bitflags::bitflags, lru::LruCache, solana_sdk::clock::Slot, | ||
}; | ||
|
||
const SLOTS_STATS_CACHE_CAPACITY: usize = 300; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming
On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X
and the tip is at X + 2000
slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.
This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY]
might be easier to manage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming
I am expecting that the struct will get bigger if @jbiseda adds hashmaps to track erasure set indices to it.
On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X and the tip is at X + 2000 slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.
That is a fair point; but it is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair. The question is if the cache capacity of 300 slots is enough to cover the active set of these slots, or we need a larger capacity.
This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY] might be easier to manage
We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough, and we could not increase cache capacity. Other alternative is that when the cache is at capacity, discard shreds which are far beyond min (or maybe max) slot in the current cache.
But also for our purposes, if the node is far behind and repairing heavily, probably it should not fire OE signal anyways, and so these metrics are not relevant anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My changes in #23725 are essentially a superset of these changes. I was seeing evictions from the cache with my changes during startup and changed the cache size to 1000. I don't think the amount of memory consumed is too concerning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved discussion here: #23872 (comment)
I agree that this does indeed need to be tracked and addressed. What is the criteria on which version of a duplicate slot to stick to? |
Fine by me, opened an issue: #23915
This is a bit complex, but essentially if ReplayStage has called
We never "give up", more like if another fork not including the duplicate slot is rooted, then we naturally deprecate all activity on the fork with the duplicate slot. However, if the cluster rooted fork is built on a duplicate slot, and we have yet to root that fork, then we will continually attempt to determine/repair the correct version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per our discussion I'll rebase changes from #23725 once this has been merged
Discussed with @jbiseda to move these metrics thing forward:
The remaining pieces are:
shred_counts: HashMap<
Slot,
HashMap</*fec set index:*/u32, /*count:*/usize>
>,
fn maybe_evict_cache(&mut self) {
while self.0.len() > SLOTS_STATS_CACHE_CAPACITY {
let (_slot, _entry) = self.0.pop_lru().unwrap();
// TODO: submit metrics for (slot, entry).
}
} cc @carllin |
Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy. (cherry picked from commit 1f9c89c)
Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy. (cherry picked from commit 1f9c89c) Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Problem
Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027
We need to track if the slot is ultimately rooted or marked dead,
and emit more metrics accordingly.
Summary of Changes
In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.