Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

expands lifetime of SlotStats #23872

Merged
merged 1 commit into from
Mar 25, 2022
Merged

Conversation

behzadnouri
Copy link
Contributor

Problem

Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027

We need to track if the slot is ultimately rooted or marked dead,
and emit more metrics accordingly.

Summary of Changes

In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.

Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027

In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.
Comment on lines +28 to +32
struct SlotFlags: u8 {
const DEAD = 0b00000001;
const FULL = 0b00000010;
const ROOTED = 0b00000100;
}
Copy link
Contributor Author

@behzadnouri behzadnouri Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda flags to track if the slot is ultimately marked as dead or rooted, ...
Currenlty DEAD and ROOTED are unused.

Comment on lines +35 to +40
#[derive(Default)]
struct SlotStats {
flags: SlotFlags,
num_repaired: usize,
num_recovered: usize,
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda This struct can be expanded with stats for each erasure batch.

Comment on lines +86 to +88
let (_slot, _entry) = self.0.pop_lru().unwrap();
// TODO: submit metrics for (slot, entry).
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda Here we can emit metrics with respect to erasure batches in the slot.

self.maybe_evict_cache();
}

pub(crate) fn set_full(&mut self, slot_meta: &SlotMeta) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbiseda need similar functions for set_dead or set_rooted.

Don't need to emit metrics in those methods until maybe_evict_cache.

@jbiseda
Copy link
Contributor

jbiseda commented Mar 23, 2022

I've done something similar for changes to #23725 merging the FEC set index counts with the existing SlotStats. I'll have an updated review shortly.

@behzadnouri
Copy link
Contributor Author

I've done something similar for changes to #23725 merging the FEC set index counts with the existing SlotStats. I'll have an updated review shortly.

This commit is only changing the lifetime of SlotStats.
Not adding anything related to erasure batches.

Comment on lines +7 to +17
macro_rules! get_mut_entry (
($cache:expr, $key:expr) => (
match $cache.get_mut(&$key) {
Some(entry) => entry,
None => {
$cache.put($key, SlotStats::default());
$cache.get_mut(&$key).unwrap()
}
}
);
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any benefit of making this a macro vs implementing it as a method in SlotsStats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do prefer to ditch the macro.
But I couldn't get it compile as a function, because of lifetime errors.
Feel free to give it a shot, and let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worked for me:

    fn get_or_default_with_eviction_check<'a>(
        stats: &'a mut MutexGuard<LruCache<Slot, SlotStats>>,
        slot: Slot,
    ) -> (&'a mut SlotStats, Option<(Slot, SlotStats)>) {
        let mut evicted = None;
        if stats.len() == stats.cap() {
            match stats.peek_lru() {
                Some((s, _)) if *s == slot => (),
                _ => {
                    evicted = Some(stats.pop_lru().expect("cache capacity should not be 0"));
                }
            }
        }
        stats.get_or_insert(slot, SlotStats::default);
        (stats.get_mut(&slot).unwrap(), evicted)
    }


bitflags! {
#[derive(Default)]
struct SlotFlags: u8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename to SlotState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already SlotStats and SlotsStats in this file.
So I guess SlotState would be bit confusing and easy to misread here.

Also we are using "flags" elsewhere in the code; e.g.
https://github.com/solana-labs/solana/blob/91c272985/sdk/src/packet.rs#L17-L36
https://github.com/solana-labs/solana/blob/91c272985/ledger/src/shred.rs#L215

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about SlotStats -> SlotMetrics and then SlotFlags -> SlotState

@carllin
Copy link
Contributor

carllin commented Mar 23, 2022

Looks like this was an issue even before this PR, but clear_unconfirmed_slot():

pub fn clear_unconfirmed_slot(&self, slot: Slot) {
should probably remove the entry from the Blockstore::slots_stats as well

@behzadnouri
Copy link
Contributor Author

Looks like this was an issue even before this PR, but clear_unconfirmed_slot():

pub fn clear_unconfirmed_slot(&self, slot: Slot) {

should probably remove the entry from the Blockstore::slots_stats as well

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else?

@carllin
Copy link
Contributor

carllin commented Mar 24, 2022

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else

clear_unconfirmed_slot() gets called when an old version of a duplicate slot gets dumped in anticipation of a new version of the slot arriving. We don't want to mark it as a dead, because we are repairing a new version of this slot so we need to attempt to replay it again. Thus if we don't reset the state/counters in the lru cache, we could mix metrics from two separate versions of the same slot

Comment on lines +76 to +77
("num_repaired", entry.num_repaired, i64),
("num_recovered", entry.num_recovered, i64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this slot was evicted from the cache and re-inserted later it's possible these two metrics could be incorrect right? In these cases should we mark the entry so that we report some default value like -1 for these entries to distinguish them from the other entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well at some point we need to evict -1 values as well.
Turbine is one-shot attempt, and with capacity of 300 slots (= 2 minutes) I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.

If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am expecting we do not receive shreds from turbine after the slot is evicted from the cache here.

So this will happen if we have >300 slots that received a shred but aren't full right? Referencing the discussion here: #23872 (comment)

It is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively >receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from >turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair.

I think this is more likely the further you are behind the tip because there are more slots near the tip that you will not choose to repair fully (repair always tries to fully repair slots connected to the root first)

If still that scenario happens, in the post-processing of metrics, if there are multiple rows of data for a slot, we can pick the one with the earliest timestamp and discard the rest.

Hmmm, we don't report currently on eviction right, is that this TODO here:

// TODO: submit metrics for (slot, entry).

We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough

Currently what would be the best way to tell from the metrics that this is happening? It seems we would report a datapoint whether or not the slots were being evicted from the cache.

What if we add a "num_turbine" here:

ShredSource::Turbine => (),
.

Then we know if num_turbine + num_repaired + num_recovered != last_index + 1 it must have been evicted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was logging whenever an entry is removed from the cache and included bool values for rooted/dead/evicted/pruned so these entries could be filtered by a WHERE clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carllin @jbiseda

  • I think at this point we need some very basic and simple metrics.
  • It also has to be very lightweight not to slow down shred ingestion just to emit metrics.
  • Once we have those metrics we can do some post analysis on the data and check what is missing and add more metrics if needed.

crate::blockstore_meta::SlotMeta, bitflags::bitflags, lru::LruCache, solana_sdk::clock::Slot,
};

const SLOTS_STATS_CACHE_CAPACITY: usize = 300;
Copy link
Contributor

@carllin carllin Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming

On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X and the tip is at X + 2000 slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.

This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY] might be easier to manage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably support this being a lot larger, the state size is pretty small, so we could extend this to a couple thousand, which is about how long we go without making roots before ooming

I am expecting that the struct will get bigger if @jbiseda adds hashmaps to track erasure set indices to it.

On a side note, one case to consider about this lru design is when the validator is running catchup. For instance if we start at snapshot X and the tip is at X + 2000 slots, we might get a lot of partially completed slots at the tip (we actively repair closer to the root so the tip will be ignored), causing the cache to hit the max case more often then intended.

That is a fair point; but it is only a problem if the lru cache capacity is not enough to cover all the slots that we are actively receiving shreds for. i.e. it should not matter how far behind the node is. The node receives shreds from turbine for the slots close to the tip + shreds from repair for the slots it chooses to repair. The question is if the cache capacity of 300 slots is enough to cover the active set of these slots, or we need a larger capacity.

This is a case where keeping a window from [root,root + SLOTS_STATS_CACHE_CAPACITY] might be easier to manage

We can iterate and revise this if lru eviction policy and the capacity of 300 was not enough, and we could not increase cache capacity. Other alternative is that when the cache is at capacity, discard shreds which are far beyond min (or maybe max) slot in the current cache.

But also for our purposes, if the node is far behind and repairing heavily, probably it should not fire OE signal anyways, and so these metrics are not relevant anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes in #23725 are essentially a superset of these changes. I was seeing evictions from the cache with my changes during startup and changed the cache size to 1000. I don't think the amount of memory consumed is too concerning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved discussion here: #23872 (comment)

@behzadnouri
Copy link
Contributor Author

With this new code, why the lru eviction wouldn't be enough?
also, wouldn't it be better to instead flag these slots as either DEAD or something else

clear_unconfirmed_slot() gets called when an old version of a duplicate slot gets dumped in anticipation of a new version of the slot arriving. We don't want to mark it as a dead, because we are repairing a new version of this slot so we need to attempt to replay it again. Thus if we don't reset the state/counters in the lru cache, we could mix metrics from two separate versions of the same slot

I agree that this does indeed need to be tracked and addressed.
But perhaps could be a follow up commit as it is not an issue introduced in this commit.

What is the criteria on which version of a duplicate slot to stick to?
and, In which case do we give up and discard all variants of a duplicate slot?

@carllin
Copy link
Contributor

carllin commented Mar 24, 2022

I agree that this does indeed need to be tracked and addressed.
But perhaps could be a follow up commit as it is not an issue introduced in this commit.

Fine by me, opened an issue: #23915

What is the criteria on which version of a duplicate slot to stick to

This is a bit complex, but essentially if ReplayStage has called clear_unconfirmed_slot() it has determined that it has the wrong version of the slot based on votes it's observed for another version in the network. Thus at this point we should discard all data/metadata about the current version of the slot and open the path for receiving a new version.

In which case do we give up and discard all variants of a duplicate slot

We never "give up", more like if another fork not including the duplicate slot is rooted, then we naturally deprecate all activity on the fork with the duplicate slot. However, if the cluster rooted fork is built on a duplicate slot, and we have yet to root that fork, then we will continually attempt to determine/repair the correct version

Copy link
Contributor

@jbiseda jbiseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per our discussion I'll rebase changes from #23725 once this has been merged

@behzadnouri
Copy link
Contributor Author

Discussed with @jbiseda to move these metrics thing forward:

  • For now, we do a very simple lightweight approach just to get the initial set of metrics from mainnet, and not to slow down shred ingestion or replay.
  • Once we have the initial set of data from mainnet, we can revise if something is missing or does not look right.
  • I will merge this commit and Jeff will add erasure batch metrics;

The remaining pieces are:

  • SlotsStats::set_dead and SlotsStats::set_rooted methods to flip respective SlotStats.flags, and invoke them from replay code. No metrics submission here.
  • Additional field in SlotStats struct to count number of shreds from turbine in each erasure batch for each slot and record that in SlotsStats::add_shred which blockstore will invoke:
shred_counts: HashMap<
      Slot, 
      HashMap</*fec set index:*/u32, /*count:*/usize>
>,
  • Submit metrics in maybe_evict_cache for the slot which is evicted; i.e. this part of the code:
fn maybe_evict_cache(&mut self) {
    while self.0.len() > SLOTS_STATS_CACHE_CAPACITY {
        let (_slot, _entry) = self.0.pop_lru().unwrap();
        // TODO: submit metrics for (slot, entry).
    }
}

cc @carllin

@behzadnouri behzadnouri merged commit 1f9c89c into solana-labs:master Mar 25, 2022
@behzadnouri behzadnouri deleted the slots-stats branch March 25, 2022 19:32
mergify bot pushed a commit that referenced this pull request Mar 30, 2022
Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027

In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.

(cherry picked from commit 1f9c89c)
behzadnouri added a commit that referenced this pull request Apr 1, 2022
Current slot stats are removed when the slot is full or every 30 seconds
if the slot is before root:
https://github.com/solana-labs/solana/blob/493a8e234/ledger/src/blockstore.rs#L2017-L2027

In order to track if the slot is ultimately marked as dead or rooted and
emit more metrics, this commit expands lifetime of SlotStats while
bounding total size of cache using an LRU eviction policy.

(cherry picked from commit 1f9c89c)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants