Improve cache eviction policy for LoadedPrograms #34391

pgarg66 · 2023-12-10T16:40:16Z

Problem

Current eviction policy for LoadedPrograms cache relies on program's usage_counter to decide which entry to evict. The usage_counter increases monotonically. This means if an old program which was frequently used in the past, but not used anymore, has higher chance of staying in the cache. This can make it harder for new programs to stay in the cache once the cache depth limits are reached.

Summary of Changes

Decay the usage_counter of a program if the program does not get used.
Use 2's random selection to pick a program entry to be evicted from the cache.
Add unit tests for the new code

Fixes #

codecov · 2023-12-10T17:59:07Z

Codecov Report

Merging #34391 (1d30f3b) into master (22bfcd9) will increase coverage by 0.0%.
Report is 8 commits behind head on master.
The diff coverage is 98.3%.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #34391    +/-   ##
========================================
  Coverage    81.8%    81.8%            
========================================
  Files         820      820            
  Lines      220865   221037   +172     
========================================
+ Hits       180736   180903   +167     
- Misses      40129    40134     +5

pgarg66 · 2023-12-10T18:18:23Z

Also, running a node against MB. No perceptible difference in eviction rate compared to the rest of the cluster.

program-runtime/src/loaded_programs.rs

t-nelson · 2023-12-11T17:28:31Z

program-runtime/src/loaded_programs.rs

+    /// Evicts programs using 2's random selection, choosing the least used program out of the two entries.
+    /// The eviction is performed enough number of times to reduce the cache usage to the given percentage.
+    pub fn evict_using_2s_random_selection(&mut self, shrink_to: PercentageInteger, now: Slot) {
+        let mut candidates = self.get_flattened_entries();


could switch self.entries from HashMap to IndexMap. then the draw is just two random samplings from 0..self.entries.len(). no allocation necessary

Changing to IndexMap will impact other parts of the code in the cache, not just eviction logic. If we agree to do it, maybe best to do it in a separate PR (in case we want to backport this). Thoughts?

yeah would definitely want to do it in a preliminary pr. their APIs are pretty compatible and perf is par iirc.

then again, just calling .iter().nth() directly on the HashMap might be sufficient 🤔

mainly trying to kill the allocation

Each entry in entries is a Vec<> of cached programs (since a given program ID can potentially have different versions of the program loaded simultaneously in cache corresponding to different slot/fork). So we need to flatten the two dimensional list to actually compute the total cache usage, and count how much to evict. I don't think we can avoid allocation even if we were to use IndexMap here.

(I think that github is having issues with PRs, my comments keep appearing and disappearing, but since this disappeared again)

I think this sounds like we should use a BTreeMap with keys (pubkey, slot)? So we can key ranges and still have a total length. It's a tiny map so perf wise it's almost certainly better than juggling a two level hash table + vec.

We do a reverse lookup for entries of a program during extract(). This helps find the most latest version of the program applicable to the given slot. That was one of the main reason/advantage of doing two level table.

Maybe this is not a big perf hit if we were to use BTreeMap. But, I think we should analyze it more carefully.

BTreeMap::range supports reverse iteration, I don't think it's going to be any slower than what we're currently doing?

We can/should certainly explore it. I believe it'll intersect with the other PR that does cooperative loading of programs.

Filed this issue to track it: #34442

program-runtime/src/loaded_programs.rs

alessandrod · 2023-12-13T08:55:09Z

runtime/src/bank.rs

@@ -5259,7 +5260,10 @@ impl Bank {
        self.loaded_programs_cache
            .write()
            .unwrap()
-            .sort_and_unload(Percentage::from(SHRINK_LOADED_PROGRAMS_TO_PERCENTAGE));


sort_and_unload is now dead code and can probably be removed right?

Left it in as an escape hatch. We can remove it once the new eviction policy is live and working on MB.

alessandrod

lgtm

The flattening isn't great but I agree that it would make this diff considerably larger and it can be tried in a separate PR

pgarg66 · 2023-12-13T20:37:10Z

lgtm

The flattening isn't great but I agree that it would make this diff considerably larger and it can be tried in a separate PR

Thanks @alessandrod
Leaving it open till tomorrow to give others a chance to take another look.

t-nelson

we intend to let this ride 1.18 stabilization, right? no 1.17 bp

pgarg66 · 2023-12-15T13:02:56Z

we intend to let this ride 1.18 stabilization, right? no 1.17 bp

Yes, we are not backporting it to any of the branches. I am waiting to merge it to let other PRs go in, as it might create merge conflicts (or make it harder to backport the other PRs).

program-runtime/src/loaded_programs.rs

Lichtso · 2023-12-18T18:30:00Z

program-runtime/src/loaded_programs.rs

@@ -935,6 +956,26 @@ impl<FG: ForkGraph> LoadedPrograms<FG> {
        })
    }

+    /// Returns the list of loaded programs which are verified and compiled.
+    fn get_flattened_entries(&self) -> Vec<(Pubkey, Arc<LoadedProgram>)> {


This could take two filter parameters include_program_runtime_v1 and include_program_runtime_v2. Then get_entries_sorted_by_tx_usage(), which is only used by the recompilation phase could be inlined there.

Lichtso · 2023-12-18T18:31:19Z

program-runtime/src/loaded_programs.rs

+        );
+    }
+
+    fn decayed_usage_counter(&self, now: Slot) -> u64 {


This should be used for the sorting at the beginning of the recompilation phase as well.

Lichtso · 2023-12-18T18:32:52Z

program-runtime/src/loaded_programs.rs

+        let _ = self.latest_access_slot.fetch_update(
+            Ordering::Relaxed,
+            Ordering::Relaxed,
+            |last_access| (last_access < slot).then_some(slot),


Looks like a manual implementation of fetch_max.

Lichtso · 2023-12-18T18:33:53Z

program-runtime/src/loaded_programs.rs

@@ -862,7 +883,7 @@ impl<FG: ForkGraph> LoadedPrograms<FG> {
                            if let LoadedProgramType::Unloaded(_environment) = &entry.program {
                                break;
                            }
-
+                            entry.update_access_slot(loaded_programs_for_tx_batch.slot);


Why only update it in this case here and not below where we also increment the usage counter?

In other cases the entry is a tombstone. The slot value will likely be useless there, as we don't evict tombstones. We can set the slot value there if you think in future it could be useful.

Yes, I would prefer to have both updates (to the usage counter and the access slot) in the same place. Also, I think it can be inlined as update_access_slot() only has one call site.

The test uses it too test_usage_counter_decay. But it could be inlined there as well.

Since we are sticking with Atomics, it'll be good to have a wrapper for test and the code.

program-runtime/src/loaded_programs.rs

…#34391)" This reverts commit 6f0133b.

…ana-labs#34391)"" This reverts commit 4488299.

Lichtso · 2024-02-02T16:16:46Z

program-runtime/src/loaded_programs.rs

+
+    pub fn decayed_usage_counter(&self, now: Slot) -> u64 {
+        let last_access = self.latest_access_slot.load(Ordering::Relaxed);
+        let decaying_for = now.saturating_sub(last_access);


decaying_for must be limited by .min(64) otherwise the line below might overflow on the bitshift.

pgarg66 marked this pull request as ready for review December 10, 2023 18:13

pgarg66 requested review from alessandrod, Lichtso and t-nelson December 10, 2023 18:16

t-nelson reviewed Dec 11, 2023

View reviewed changes

pgarg66 force-pushed the cache-eviction branch from 4556a50 to c8bb69b Compare December 11, 2023 18:31

pgarg66 requested review from t-nelson and behzadnouri December 11, 2023 19:43

alessandrod reviewed Dec 12, 2023

View reviewed changes

program-runtime/src/loaded_programs.rs Outdated Show resolved Hide resolved

program-runtime/src/loaded_programs.rs Show resolved Hide resolved

program-runtime/src/loaded_programs.rs Show resolved Hide resolved

pgarg66 force-pushed the cache-eviction branch from 5fc1e44 to 9aede9f Compare December 12, 2023 16:05

pgarg66 requested a review from alessandrod December 12, 2023 16:13

alessandrod reviewed Dec 13, 2023

View reviewed changes

program-runtime/src/loaded_programs.rs Outdated Show resolved Hide resolved

alessandrod reviewed Dec 13, 2023

View reviewed changes

pgarg66 requested a review from alessandrod December 13, 2023 11:33

pgarg66 mentioned this pull request Dec 13, 2023

Explore using BTreeMap instead of two level HashMap/Vec in LoadedPrograms #34442

Closed

alessandrod previously approved these changes Dec 13, 2023

View reviewed changes

t-nelson previously approved these changes Dec 15, 2023

View reviewed changes

pgarg66 dismissed stale reviews from t-nelson and alessandrod via 21807c0 December 18, 2023 16:51

pgarg66 force-pushed the cache-eviction branch from 2259cbd to 21807c0 Compare December 18, 2023 16:51

pgarg66 added 6 commits December 18, 2023 09:01

Use 2's random selection to evict program cache

e21856c

implement decaying of usage counter

03f0179

fix test compilation

a865b42

replace RwLock with AtomicU64

09802df

address review comments

0678afc

address more review comments

9961580

remove -> swap_remove

b790e74

pgarg66 force-pushed the cache-eviction branch from 21807c0 to b790e74 Compare December 18, 2023 17:01

Lichtso reviewed Dec 18, 2023

View reviewed changes

more review comments

1d30f3b

pgarg66 requested a review from Lichtso December 18, 2023 21:16

Lichtso reviewed Dec 18, 2023

View reviewed changes

program-runtime/src/loaded_programs.rs Show resolved Hide resolved

Lichtso approved these changes Dec 18, 2023

View reviewed changes

pgarg66 merged commit 6f0133b into solana-labs:master Dec 18, 2023
19 checks passed

pgarg66 deleted the cache-eviction branch December 18, 2023 22:51

ryoqun added a commit to ryoqun/solana that referenced this pull request Dec 21, 2023

Revert "Improve cache eviction policy for LoadedPrograms (solana-labs…

4488299

…#34391)" This reverts commit 6f0133b.

ryoqun added a commit to ryoqun/solana that referenced this pull request Dec 21, 2023

Revert "Revert "Improve cache eviction policy for LoadedPrograms (sol…

ac11b8c

…ana-labs#34391)"" This reverts commit 4488299.

Lichtso reviewed Feb 2, 2024

View reviewed changes

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve cache eviction policy for LoadedPrograms #34391

Improve cache eviction policy for LoadedPrograms #34391

pgarg66 commented Dec 10, 2023

codecov bot commented Dec 10, 2023 •

edited

Loading

pgarg66 commented Dec 10, 2023

t-nelson Dec 11, 2023

pgarg66 Dec 11, 2023

t-nelson Dec 12, 2023

pgarg66 Dec 12, 2023

alessandrod Dec 12, 2023 •

edited

Loading

pgarg66 Dec 12, 2023

alessandrod Dec 13, 2023

pgarg66 Dec 13, 2023

alessandrod Dec 13, 2023

pgarg66 Dec 13, 2023

alessandrod left a comment

pgarg66 commented Dec 13, 2023

t-nelson left a comment

pgarg66 commented Dec 15, 2023

Lichtso Dec 18, 2023

Lichtso Dec 18, 2023

Lichtso Dec 18, 2023 •

edited

Loading

Lichtso Dec 18, 2023

pgarg66 Dec 18, 2023

Lichtso Dec 18, 2023

pgarg66 Dec 18, 2023 •

edited

Loading

pgarg66 Dec 18, 2023

Lichtso Feb 2, 2024

Improve cache eviction policy for LoadedPrograms #34391

Improve cache eviction policy for LoadedPrograms #34391

Conversation

pgarg66 commented Dec 10, 2023

Problem

Summary of Changes

codecov bot commented Dec 10, 2023 • edited Loading

Codecov Report

pgarg66 commented Dec 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrod Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alessandrod left a comment

Choose a reason for hiding this comment

pgarg66 commented Dec 13, 2023

t-nelson left a comment

Choose a reason for hiding this comment

pgarg66 commented Dec 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lichtso Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgarg66 Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 10, 2023 •

edited

Loading

alessandrod Dec 12, 2023 •

edited

Loading

Lichtso Dec 18, 2023 •

edited

Loading

pgarg66 Dec 18, 2023 •

edited

Loading