expands lifetime of SlotStats #23872

behzadnouri · 2022-03-23T15:27:26Z

Problem

Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027

We need to track if the slot is ultimately rooted or marked dead,
and emit more metrics accordingly.

Summary of Changes

In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.

Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy.

behzadnouri · 2022-03-23T15:28:50Z

ledger/src/slot_stats.rs

+    struct SlotFlags: u8 {
+        const DEAD   = 0b00000001;
+        const FULL   = 0b00000010;
+        const ROOTED = 0b00000100;
+    }


@jbiseda flags to track if the slot is ultimately marked as dead or rooted, ...
Currenlty DEAD and ROOTED are unused.

behzadnouri · 2022-03-23T15:29:32Z

ledger/src/slot_stats.rs

+#[derive(Default)]
+struct SlotStats {
+    flags: SlotFlags,
+    num_repaired: usize,
+    num_recovered: usize,
+}


@jbiseda This struct can be expanded with stats for each erasure batch.

behzadnouri · 2022-03-23T15:30:44Z

ledger/src/slot_stats.rs

+            let (_slot, _entry) = self.0.pop_lru().unwrap();
+            // TODO: submit metrics for (slot, entry).
+        }


@jbiseda Here we can emit metrics with respect to erasure batches in the slot.

behzadnouri · 2022-03-23T15:38:44Z

ledger/src/slot_stats.rs

+        self.maybe_evict_cache();
+    }
+
+    pub(crate) fn set_full(&mut self, slot_meta: &SlotMeta) {


@jbiseda need similar functions for set_dead or set_rooted.

Don't need to emit metrics in those methods until maybe_evict_cache.

jbiseda · 2022-03-23T19:50:14Z

I've done something similar for changes to #23725 merging the FEC set index counts with the existing SlotStats. I'll have an updated review shortly.

behzadnouri · 2022-03-23T19:52:50Z

I've done something similar for changes to #23725 merging the FEC set index counts with the existing SlotStats. I'll have an updated review shortly.

This commit is only changing the lifetime of SlotStats.
Not adding anything related to erasure batches.

carllin · 2022-03-23T20:35:25Z

ledger/src/slot_stats.rs

+macro_rules! get_mut_entry (
+    ($cache:expr, $key:expr) => (
+        match $cache.get_mut(&$key) {
+            Some(entry) => entry,
+            None => {
+                $cache.put($key, SlotStats::default());
+                $cache.get_mut(&$key).unwrap()
+            }
+        }
+    );
+);


any benefit of making this a macro vs implementing it as a method in SlotsStats?

I actually do prefer to ditch the macro.
But I couldn't get it compile as a function, because of lifetime errors.
Feel free to give it a shot, and let me know.

this worked for me:

fn get_or_default_with_eviction_check<'a>( stats: &'a mut MutexGuard<LruCache<Slot, SlotStats>>, slot: Slot, ) -> (&'a mut SlotStats, Option<(Slot, SlotStats)>) { let mut evicted = None; if stats.len() == stats.cap() { match stats.peek_lru() { Some((s, _)) if *s == slot => (), _ => { evicted = Some(stats.pop_lru().expect("cache capacity should not be 0")); } } } stats.get_or_insert(slot, SlotStats::default); (stats.get_mut(&slot).unwrap(), evicted) }

carllin · 2022-03-23T20:38:02Z

ledger/src/slot_stats.rs

+
+bitflags! {
+    #[derive(Default)]
+    struct SlotFlags: u8 {


nit: rename to SlotState

There are already SlotStats and SlotsStats in this file.
So I guess SlotState would be bit confusing and easy to misread here.

Also we are using "flags" elsewhere in the code; e.g.
https://github.com/solana-labs/solana/blob/91c272985/sdk/src/packet.rs#L17-L36
https://github.com/solana-labs/solana/blob/91c272985/ledger/src/shred.rs#L215

what about SlotStats -> SlotMetrics and then SlotFlags -> SlotState

carllin · 2022-03-23T20:48:25Z

Looks like this was an issue even before this PR, but clear_unconfirmed_slot():

solana/ledger/src/blockstore.rs

Line 1427 in 81d88f0

pub fn clear_unconfirmed_slot(&self, slot: Slot) {

should probably remove the entry from the Blockstore::slots_stats as well

ledger/src/slot_stats.rs

behzadnouri · 2022-03-23T20:55:28Z

Looks like this was an issue even before this PR, but clear_unconfirmed_slot():

solana/ledger/src/blockstore.rs

Line 1427 in 81d88f0

pub fn clear_unconfirmed_slot(&self, slot: Slot) {

should probably remove the entry from the Blockstore::slots_stats as well

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else?

carllin · 2022-03-24T01:04:58Z

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else

clear_unconfirmed_slot() gets called when an old version of a duplicate slot gets dumped in anticipation of a new version of the slot arriving. We don't want to mark it as a dead, because we are repairing a new version of this slot so we need to attempt to replay it again. Thus if we don't reset the state/counters in the lru cache, we could mix metrics from two separate versions of the same slot

carllin · 2022-03-24T02:54:09Z

ledger/src/slot_stats.rs

+                ("num_repaired", entry.num_repaired, i64),
+                ("num_recovered", entry.num_recovered, i64),


If this slot was evicted from the cache and re-inserted later it's possible these two metrics could be incorrect right? In these cases should we mark the entry so that we report some default value like -1 for these entries to distinguish them from the other entries?

Well at some point we need to evict -1 values as well.
Turbine is one-shot attempt, and with capacity of 300 slots (= 2 minutes) I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.

If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.

I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.

So this will happen if we have >300 slots that received a shred but aren't full right? Referencing the discussion here: #23872 (comment)

It is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively >receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from >turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair.

I think this is more likely the further you are behind the tip because there are more slots near the tip that you will not choose to repair fully (repair always tries to fully repair slots connected to the root first)

If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.

Hmmm, we don't report currently on eviction right, is that this TODO here:

solana/ledger/src/slot_stats.rs

Line 87 in 81d88f0

// TODO: submit metrics for (slot, entry).

We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough

Currently what would be the best way to tell from the metrics that this is happening? It seems we would report a datapoint whether or not the slots were being evicted from the cache.

What if we add a "num_turbine" here:

solana/ledger/src/slot_stats.rs

Line 55 in 81d88f0

ShredSource::Turbine => (),

.

Then we know if num_turbine + num_repaired + num_recovered != last_index + 1 it must have been evicted

I was logging whenever an entry is removed from the cache and included bool values for rooted/dead/evicted/pruned so these entries could be filtered by a WHERE clause.

@carllin @jbiseda

I think at this point we need some very basic and simple metrics.

It also has to be very lightweight not to slow down shred ingestion just to emit metrics.

Once we have those metrics we can do some post analysis on the data and check what is missing and add more metrics if needed.

carllin · 2022-03-24T03:02:10Z

ledger/src/slot_stats.rs

+    crate::blockstore_meta::SlotMeta, bitflags::bitflags, lru::LruCache, solana_sdk::clock::Slot,
+};
+
+const SLOTS_STATS_CACHE_CAPACITY: usize = 300;


Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming

On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X and the tip is at X + 2000 slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.

This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY] might be easier to manage

Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming

I am expecting that the struct will get bigger if @jbiseda adds hashmaps to track erasure set indices to it.

On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X and the tip is at X + 2000 slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.

That is a fair point; but it is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair. The question is if the cache capacity of 300 slots is enough to cover the active set of these slots, or we need a larger capacity.

This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY] might be easier to manage

We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough, and we could not increase cache capacity. Other alternative is that when the cache is at capacity, discard shreds which are far beyond min (or maybe max) slot in the current cache.

But also for our purposes, if the node is far behind and repairing heavily, probably it should not fire OE signal anyways, and so these metrics are not relevant anymore.

My changes in #23725 are essentially a superset of these changes. I was seeing evictions from the cache with my changes during startup and changed the cache size to 1000. I don't think the amount of memory consumed is too concerning.

Moved discussion here: #23872 (comment)

behzadnouri · 2022-03-24T13:15:26Z

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else

clear_unconfirmed_slot() gets called when an old version of a duplicate slot gets dumped in anticipation of a new version of the slot arriving. We don't want to mark it as a dead, because we are repairing a new version of this slot so we need to attempt to replay it again. Thus if we don't reset the state/counters in the lru cache, we could mix metrics from two separate versions of the same slot

I agree that this does indeed need to be tracked and addressed.
But perhaps could be a follow up commit as it is not an issue introduced in this commit.

What is the criteria on which version of a duplicate slot to stick to?
and, In which case do we give up and discard all variants of a duplicate slot?

carllin · 2022-03-24T20:14:48Z

I agree that this does indeed need to be tracked and addressed.
But perhaps could be a follow up commit as it is not an issue introduced in this commit.

Fine by me, opened an issue: #23915

What is the criteria on which version of a duplicate slot to stick to

This is a bit complex, but essentially if ReplayStage has called clear_unconfirmed_slot() it has determined that it has the wrong version of the slot based on votes it's observed for another version in the network. Thus at this point we should discard all data/metadata about the current version of the slot and open the path for receiving a new version.

In which case do we give up and discard all variants of a duplicate slot

We never "give up", more like if another fork not including the duplicate slot is rooted, then we naturally deprecate all activity on the fork with the duplicate slot. However, if the cluster rooted fork is built on a duplicate slot, and we have yet to root that fork, then we will continually attempt to determine/repair the correct version

jbiseda

per our discussion I'll rebase changes from #23725 once this has been merged

behzadnouri · 2022-03-25T19:31:53Z

Discussed with @jbiseda to move these metrics thing forward:

For now, we do a very simple lightweight approach just to get the initial set of metrics from mainnet, and not to slow down shred ingestion or replay.
Once we have the initial set of data from mainnet, we can revise if something is missing or does not look right.
I will merge this commit and Jeff will add erasure batch metrics;

The remaining pieces are:

SlotsStats::set_dead and SlotsStats::set_rooted methods to flip respective SlotStats.flags, and invoke them from replay code. No metrics submission here.
Additional field in SlotStats struct to count number of shreds from turbine in each erasure batch for each slot and record that in SlotsStats::add_shred which blockstore will invoke:

shred_counts: HashMap<
      Slot, 
      HashMap</*fec set index:*/u32, /*count:*/usize>
>,

Submit metrics in maybe_evict_cache for the slot which is evicted; i.e. this part of the code:

fn maybe_evict_cache(&mut self) {
    while self.0.len() > SLOTS_STATS_CACHE_CAPACITY {
        let (_slot, _entry) = self.0.pop_lru().unwrap();
        // TODO: submit metrics for (slot, entry).
    }
}

cc @carllin

Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy. (cherry picked from commit 1f9c89c)

Current slot stats are removed when the slot is full or every 30 seconds if the slot is before root: https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027 In order to track if the slot is ultimately marked as dead or rooted and emit more metrics, this commit expands lifetime of SlotStats while bounding total size of cache using an LRU eviction policy. (cherry picked from commit 1f9c89c) Co-authored-by: behzad nouri <behzadnouri@gmail.com>

behzadnouri requested review from carllin, jbiseda and sakridge March 23, 2022 15:27

behzadnouri commented Mar 23, 2022

View reviewed changes

behzadnouri mentioned this pull request Mar 23, 2022

turbine shred batch stats #23725

Closed

behzadnouri commented Mar 23, 2022

View reviewed changes

carllin reviewed Mar 23, 2022

View reviewed changes

ledger/src/slot_stats.rs Show resolved Hide resolved

carllin reviewed Mar 23, 2022

View reviewed changes

ledger/src/slot_stats.rs Show resolved Hide resolved

carllin reviewed Mar 24, 2022

View reviewed changes

carllin approved these changes Mar 25, 2022

View reviewed changes

jbiseda approved these changes Mar 25, 2022

View reviewed changes

behzadnouri merged commit 1f9c89c into solana-labs:master Mar 25, 2022

behzadnouri deleted the slots-stats branch March 25, 2022 19:32

behzadnouri added the v1.10 label Mar 30, 2022

mergify bot mentioned this pull request Mar 30, 2022

expands lifetime of SlotStats (backport #23872) #24002

Merged

behzadnouri mentioned this pull request Mar 31, 2022

track fec set turbine stats #23989

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expands lifetime of SlotStats #23872

expands lifetime of SlotStats #23872

behzadnouri commented Mar 23, 2022

behzadnouri Mar 23, 2022 •

edited

Loading

behzadnouri Mar 23, 2022

behzadnouri Mar 23, 2022

behzadnouri Mar 23, 2022

jbiseda commented Mar 23, 2022

behzadnouri commented Mar 23, 2022

carllin Mar 23, 2022

behzadnouri Mar 23, 2022

jbiseda Mar 24, 2022

carllin Mar 23, 2022

behzadnouri Mar 24, 2022

carllin Mar 24, 2022

carllin commented Mar 23, 2022

behzadnouri commented Mar 23, 2022

carllin commented Mar 24, 2022 •

edited

Loading

carllin Mar 24, 2022

behzadnouri Mar 24, 2022

carllin Mar 24, 2022

jbiseda Mar 24, 2022

behzadnouri Mar 25, 2022

carllin Mar 24, 2022 •

edited

Loading

behzadnouri Mar 24, 2022

jbiseda Mar 24, 2022

carllin Mar 24, 2022

behzadnouri commented Mar 24, 2022

carllin commented Mar 24, 2022 •

edited

Loading

jbiseda left a comment

behzadnouri commented Mar 25, 2022

		("num_repaired", entry.num_repaired, i64),
		("num_recovered", entry.num_recovered, i64),

expands lifetime of SlotStats #23872

expands lifetime of SlotStats #23872

Conversation

behzadnouri commented Mar 23, 2022

Problem

Summary of Changes

behzadnouri Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbiseda commented Mar 23, 2022

behzadnouri commented Mar 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin commented Mar 23, 2022

behzadnouri commented Mar 23, 2022

carllin commented Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Mar 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri commented Mar 24, 2022

carllin commented Mar 24, 2022 • edited Loading

jbiseda left a comment

Choose a reason for hiding this comment

behzadnouri commented Mar 25, 2022

behzadnouri Mar 23, 2022 •

edited

Loading

carllin commented Mar 24, 2022 •

edited

Loading

carllin Mar 24, 2022 •

edited

Loading

carllin commented Mar 24, 2022 •

edited

Loading