near_vm: reuse contract data/memory between contract executions #10851

nagisa · 2024-03-21T15:03:33Z

Currently we reuse memory maps that we allocate for storing compiled machine code for the contracts. This has been found to work great and improves the performance by a great degree. This is because the kernel needs to do quite a bit of accounting each time memory maps are set up.

Fortunately for us, the contracts all have a very specific amount of memory available to them. So we could allocate the necessary number of contract memories up-front and lock them so that they are not paged out (paging the memories back in => overhead)

The only thing we need to be careful about is that unlike code memories (which are not observable by the contracts), the data memories must be zeroed out after use in order to prevent leaking the data between contracts through them.

This could reduce the total execute_function_call time by roughly 20% on average in the post 1.39 world (provided that other relevant caches kick in.)

The text was updated successfully, but these errors were encountered:

akhi3030 · 2024-04-02T06:04:53Z

CC: @Ekleog-NEAR

Ekleog-NEAR · 2024-04-11T12:33:11Z

Hackish experiment result

TL;DR

Reusing the same memories leads to us spending much more time zeroing out memory. As I originally feared, most of the time in mmap was likely actually the kernel zeroing out only the used parts of the memory, and once we stop relying on it we need to zero out much more.

So we most likely should not apply just this change as initially considered.

However, if we are able to reduce the initial size of the wasm memory (which would require a protocol change and might break contracts) from 64MiB to 64kiB, then we could hope for some speedup on congested shards. An upper bound would be 60%, the numbers from shard 2. However this is likely a way overestimated upper bound, because I could not figure out how to easily compare the hackish code and the non-hackish code, due to the protocol version change. In particular, my guess would be my benchmarks actually lead to the contracts under test to start failing at startup, and thus not actually executing. Shard 4 exhibited a 15% speedup, which sounds like a more likely upper bound.

My suggestion would be to drop this line of changes, because either:

it will actually make things 3x as slow as they currently are, or
it will likely break a significant number of contracts, by not giving them enough memory at startup

Non-protocol-change attempt

Code used

On top of commit 4c0aa98, this patch:

diff --git a/Cargo.lock b/Cargo.lock
index 5b2906165..1e0355699 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4931,11 +4931,13 @@ dependencies = [
  "bolero",
  "borsh 1.0.0",
  "cov-mark",
+ "crossbeam",
  "ed25519-dalek",
  "enum-map",
  "expect-test",
  "finite-wasm",
  "hex",
+ "lazy_static",
  "lru 0.12.3",
  "memoffset 0.8.0",
  "near-crypto",
diff --git a/runtime/near-vm-runner/Cargo.toml b/runtime/near-vm-runner/Cargo.toml
index 1e6be8d23..658296ec2 100644
--- a/runtime/near-vm-runner/Cargo.toml
+++ b/runtime/near-vm-runner/Cargo.toml
@@ -17,9 +17,11 @@ anyhow = { workspace = true, optional = true }
 base64.workspace = true
 bn.workspace = true
 borsh.workspace = true
+crossbeam.workspace = true
 ed25519-dalek.workspace = true
 enum-map.workspace = true
 finite-wasm = { workspace = true, features = ["instrument"], optional = true }
+lazy_static.workspace = true
 lru = "0.12.3"
 memoffset.workspace = true
 num-rational.workspace = true
diff --git a/runtime/near-vm-runner/src/near_vm_runner.rs b/runtime/near-vm-runner/src/near_vm_runner.rs
index 0940bd629..b10ffb331 100644
--- a/runtime/near-vm-runner/src/near_vm_runner.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner.rs
@@ -438,11 +438,15 @@ impl NearVM {
             },
         )?;
 
-        let mut memory = NearVmMemory::new(
+        lazy_static::lazy_static! {
+            static ref MEMORIES: crossbeam::queue::ArrayQueue<NearVmMemory> = crossbeam::queue::ArrayQueue::new(16);
+        }
+        let mut memory = MEMORIES.pop()
+        .unwrap_or_else(|| NearVmMemory::new(
             self.config.limit_config.initial_memory_pages,
             self.config.limit_config.max_memory_pages,
         )
-        .expect("Cannot create memory for a contract call");
+        .expect("Cannot create memory for a contract call"));
         // FIXME: this mostly duplicates the `run_module` method.
         // Note that we don't clone the actual backing memory, just increase the RC.
         let vmmemory = memory.vm();
@@ -459,7 +463,18 @@ impl NearVM {
                 if let Err(e) = result {
                     return Ok(VMOutcome::abort(logic, e));
                 }
-                closure(vmmemory, logic, &artifact)
+                let res = closure(vmmemory, logic, &artifact);
+                let mut mem = memory.0.vmmemory();
+                unsafe {
+                    let mem = mem.as_mut();
+                    mem.base.write_bytes(0, mem.current_length);
+                    // Tried both with and without this line, with similar results.
+                    // My intuition is, this line should actually be present, so the detailed evaluation was performed
+                    // with it on.
+                    mem.current_length = self.config.limit_config.initial_memory_pages as usize * WASM_PAGE_SIZE;
+                }
+                let _ = MEMORIES.push(memory);
+                res
             }
             Err(e) => Ok(VMOutcome::abort(logic, FunctionCallError::CompilationError(e))),
         }

With the `current_length` reset

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      5.248 s ±  0.206 s    [User: 9.586 s, System: 3.468 s]
  Range (min … max):    5.108 s …  5.484 s    3 runs
 
Benchmark 2: ./neard-new view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):     14.585 s ±  0.107 s    [User: 26.067 s, System: 5.049 s]
  Range (min … max):   14.485 s … 14.698 s    3 runs
 
Summary
  ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking ran
    2.78 ± 0.11 times faster than ./neard-new view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking

Without the `current_length` reset

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      5.090 s ±  0.042 s    [User: 4.790 s, System: 2.703 s]
  Range (min … max):    5.061 s …  5.139 s    3 runs
 
Benchmark 2: ./neard-new view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):     14.134 s ±  0.495 s    [User: 13.849 s, System: 2.698 s]
  Range (min … max):   13.792 s … 14.702 s    3 runs
 
Summary
  ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking ran
    2.78 ± 0.10 times faster than ./neard-new view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking

Profiling results

I tried profiling the changed binary, to confirm that the additional time is indeed spent zeroing out memory and not affecting other parts of the codebase due to a bug in my implementation. 66% of the time is spent in a libc function just below with_compiled_and_loaded, which seems to confirm the TL;DR interpretation above.

Protocol change attempt

Changing `initial_memory_pages` to 1

Seeing how we currently start our contracts with 64M of memory, I wanted to try out starting contracts with 64k too. However, this is a protocol change, so it’ll be harder to land. Still, there’s a chance that it could significantly improve performance of the change, so I wanted to try it out.

Limitations

This is a protocol change
It could break contracts (we’d need to mirror the traffic at least and verify that the outcomes are relatively the same)
It does change behavior in our regular chain operations (unless this is a bug in my hackish test code, I had to patch to ignore state changes)
The results listed below are likely inaccurate, because of all the above reasons

Code

diff --git a/Cargo.lock b/Cargo.lock
index 5b2906165..1e0355699 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -4931,11 +4931,13 @@ dependencies = [
  "bolero",
  "borsh 1.0.0",
  "cov-mark",
+ "crossbeam",
  "ed25519-dalek",
  "enum-map",
  "expect-test",
  "finite-wasm",
  "hex",
+ "lazy_static",
  "lru 0.12.3",
  "memoffset 0.8.0",
  "near-crypto",
diff --git a/runtime/near-vm-runner/Cargo.toml b/runtime/near-vm-runner/Cargo.toml
index 1e6be8d23..658296ec2 100644
--- a/runtime/near-vm-runner/Cargo.toml
+++ b/runtime/near-vm-runner/Cargo.toml
@@ -17,9 +17,11 @@ anyhow = { workspace = true, optional = true }
 base64.workspace = true
 bn.workspace = true
 borsh.workspace = true
+crossbeam.workspace = true
 ed25519-dalek.workspace = true
 enum-map.workspace = true
 finite-wasm = { workspace = true, features = ["instrument"], optional = true }
+lazy_static.workspace = true
 lru = "0.12.3"
 memoffset.workspace = true
 num-rational.workspace = true
diff --git a/runtime/near-vm-runner/src/near_vm_runner.rs b/runtime/near-vm-runner/src/near_vm_runner.rs
index 0940bd629..488f8cdfa 100644
--- a/runtime/near-vm-runner/src/near_vm_runner.rs
+++ b/runtime/near-vm-runner/src/near_vm_runner.rs
@@ -438,11 +438,15 @@ impl NearVM {
             },
         )?;
 
-        let mut memory = NearVmMemory::new(
-            self.config.limit_config.initial_memory_pages,
+        lazy_static::lazy_static! {
+            static ref MEMORIES: crossbeam::queue::ArrayQueue<NearVmMemory> = crossbeam::queue::ArrayQueue::new(16);
+        }
+        let mut memory = MEMORIES.pop()
+        .unwrap_or_else(|| NearVmMemory::new(
+            1,
             self.config.limit_config.max_memory_pages,
         )
-        .expect("Cannot create memory for a contract call");
+        .expect("Cannot create memory for a contract call"));
         // FIXME: this mostly duplicates the `run_module` method.
         // Note that we don't clone the actual backing memory, just increase the RC.
         let vmmemory = memory.vm();
@@ -459,7 +463,18 @@ impl NearVM {
                 if let Err(e) = result {
                     return Ok(VMOutcome::abort(logic, e));
                 }
-                closure(vmmemory, logic, &artifact)
+                let res = closure(vmmemory, logic, &artifact);
+                let mut mem = memory.0.vmmemory();
+                unsafe {
+                    let mem = mem.as_mut();
+                    mem.base.write_bytes(0, mem.current_length);
+                    // Tried both with and without this line, with similar results.
+                    // My intuition is, this line should actually be present, so the detailed evaluation was performed
+                    // with it on.
+                    mem.current_length = 1 as usize * WASM_PAGE_SIZE;
+                }
+                let _ = MEMORIES.push(memory);
+                res
             }
             Err(e) => Ok(VMOutcome::abort(logic, FunctionCallError::CompilationError(e))),
         }
diff --git a/runtime/near-vm-runner/src/prepare/prepare_v2.rs b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
index 894dea6fd..71f8c9835 100644
--- a/runtime/near-vm-runner/src/prepare/prepare_v2.rs
+++ b/runtime/near-vm-runner/src/prepare/prepare_v2.rs
@@ -241,7 +241,7 @@ impl<'a> PrepareContext<'a> {
 
     fn memory_import(&self) -> wasm_encoder::EntityType {
         wasm_encoder::EntityType::Memory(wasm_encoder::MemoryType {
-            minimum: u64::from(self.config.limit_config.initial_memory_pages),
+            minimum: 1,
             maximum: Some(u64::from(self.config.limit_config.max_memory_pages)),
             memory64: false,
             shared: false,
diff --git a/tools/state-viewer/src/apply_chain_range.rs b/tools/state-viewer/src/apply_chain_range.rs
index e4ee3e5ac..bd0e93e86 100644
--- a/tools/state-viewer/src/apply_chain_range.rs
+++ b/tools/state-viewer/src/apply_chain_range.rs
@@ -295,8 +295,8 @@ fn apply_block_from_range(
             if verbose_output {
                 println!("block_height: {}, block_hash: {}\nchunk_extra: {:#?}\nexisting_chunk_extra: {:#?}\noutcomes: {:#?}", height, block_hash, chunk_extra, existing_chunk_extra, apply_result.outcomes);
             }
-            if !smart_equals(&existing_chunk_extra, &chunk_extra) {
-                panic!("Got a different ChunkExtra:\nblock_height: {}, block_hash: {}\nchunk_extra: {:#?}\nexisting_chunk_extra: {:#?}\nnew outcomes: {:#?}\n\nold outcomes: {:#?}\n", height, block_hash, chunk_extra, existing_chunk_extra, apply_result.outcomes, old_outcomes(store, &apply_result.outcomes));
+            std::hint::black_box(!smart_equals(&existing_chunk_extra, &chunk_extra)); {
+                // panic!("Got a different ChunkExtra:\nblock_height: {}, block_hash: {}\nchunk_extra: {:#?}\nexisting_chunk_extra: {:#?}\nnew outcomes: {:#?}\n\nold outcomes: {:#?}\n", height, block_hash, chunk_extra, existing_chunk_extra, apply_result.outcomes, old_outcomes(store, &apply_result.outcomes));
             }
         }
         None => {

Results

Interestingly, while my first tests completely broke the db (which is expected due to behavior changing, and flat state thus being different), subsequent ones seemed to work correctly.

The results are, overall no change for the low-congestion shards. Shard 2 get a ~60% speedup. This is way more than should be possible considering the changes, so most likely the contracts there just broke down. Shard 4 gets a 15% speedup, which sounds like a more reasonable upper bound, but it’s likely some contract there broke due to the protocol change too.

Shard 0

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 0 --use-flat-storage benchmarking
  Time (mean ± σ):      3.219 s ±  0.043 s    [User: 5.772 s, System: 2.094 s]
  Range (min … max):    3.184 s …  3.267 s    3 runs
 
Benchmark 2: ./neard-1-page view-state --readwrite apply-range --shard-id 0 --use-flat-storage benchmarking
  Time (mean ± σ):      3.358 s ±  0.048 s    [User: 6.102 s, System: 2.066 s]
  Range (min … max):    3.319 s …  3.411 s    3 runs
 
Summary
  ./neard-base view-state --readwrite apply-range --shard-id 0 --use-flat-storage benchmarking ran
    1.04 ± 0.02 times faster than ./neard-1-page view-state --readwrite apply-range --shard-id 0 --use-flat-storage benchmarking

Shard 1

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      1.884 s ±  0.012 s    [User: 1.663 s, System: 0.751 s]
  Range (min … max):    1.872 s …  1.895 s    3 runs
 
Benchmark 2: ./neard-1-page view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking
  Time (mean ± σ):      1.907 s ±  0.007 s    [User: 1.645 s, System: 0.699 s]
  Range (min … max):    1.903 s …  1.915 s    3 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Summary
  ./neard-base view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking ran
    1.01 ± 0.01 times faster than ./neard-1-page view-state --readwrite apply-range --shard-id 1 --use-flat-storage benchmarking

Shard 2

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      5.398 s ±  0.045 s    [User: 8.499 s, System: 3.128 s]
  Range (min … max):    5.346 s …  5.426 s    3 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
Benchmark 2: ./neard-1-page view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking
  Time (mean ± σ):      3.404 s ±  0.003 s    [User: 5.306 s, System: 2.654 s]
  Range (min … max):    3.401 s …  3.408 s    3 runs
 
Summary
  ./neard-1-page view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking ran
    1.59 ± 0.01 times faster than ./neard-base view-state --readwrite apply-range --shard-id 2 --use-flat-storage benchmarking

Shard 3

No results, patch development completely broke my db and I have not recreated my mocknet node yet. The results from other shards are likely enough data for this.

Shard 4

Benchmark 1: ./neard-base view-state --readwrite apply-range --shard-id 4 --use-flat-storage benchmarking
  Time (mean ± σ):      4.365 s ±  0.028 s    [User: 8.410 s, System: 2.878 s]
  Range (min … max):    4.339 s …  4.394 s    3 runs
 
Benchmark 2: ./neard-1-page view-state --readwrite apply-range --shard-id 4 --use-flat-storage benchmarking
  Time (mean ± σ):      3.837 s ±  0.466 s    [User: 8.054 s, System: 2.980 s]
  Range (min … max):    3.553 s …  4.375 s    3 runs
 
  Warning: The first benchmarking run for this command was significantly slower than the rest (4.375 s). This could be caused by (filesystem) caches that were not filled until after the first run. You are already using both the '--warmup' option as well as the '--prepare' option. Consider re-running the benchmark on a quiet system. Maybe it was a random outlier. Alternatively, consider increasing the warmup count.
 
Summary
  ./neard-1-page view-state --readwrite apply-range --shard-id 4 --use-flat-storage benchmarking ran
    1.14 ± 0.14 times faster than ./neard-base view-state --readwrite apply-range --shard-id 4 --use-flat-storage benchmarking

(Note that the filesystem cache was actually warmed up by 2 runs, so… this is likely the same as the statistical outliers listed above)

Shard 5

Like shard 3, broke my db while developing the patch, and have not recreated my mocknet node yet. The results from other shards are likely enough data for this.

Ekleog-NEAR · 2024-04-11T14:40:09Z

Closing this as a consequence of the above message. Future work can be tracked with #11033

nagisa added A-contract-runtime Area: contract compilation and execution, virtual machines, etc T-contract-runtime Team: issues relevant to the contract runtime team labels Mar 21, 2024

akhi3030 mentioned this issue Apr 8, 2024

[Tracking issue] contract runtime work on alleviating congestion on mainnet #10981

Open

akhi3030 assigned Ekleog-NEAR Apr 8, 2024

Ekleog-NEAR mentioned this issue Apr 11, 2024

Perf idea: better runtime support for contracts that use less memory #11033

Closed

Ekleog-NEAR closed this as completed Apr 11, 2024

nagisa mentioned this issue May 10, 2024

near-vm: Allow straightforward reuse of pre-faulted memory maps for storing executable machine code and wasm memories. #8950

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

near_vm: reuse contract data/memory between contract executions #10851

near_vm: reuse contract data/memory between contract executions #10851

nagisa commented Mar 21, 2024

akhi3030 commented Apr 2, 2024

Ekleog-NEAR commented Apr 11, 2024

Ekleog-NEAR commented Apr 11, 2024

near_vm: reuse contract data/memory between contract executions #10851

near_vm: reuse contract data/memory between contract executions #10851

Comments

nagisa commented Mar 21, 2024

akhi3030 commented Apr 2, 2024

Ekleog-NEAR commented Apr 11, 2024

Hackish experiment result

TL;DR

Non-protocol-change attempt

Code used

With the current_length reset

Without the current_length reset

Profiling results

Protocol change attempt

Changing initial_memory_pages to 1

Limitations

Code

Results

Shard 0

Shard 1

Shard 2

Shard 3

Shard 4

Shard 5

Ekleog-NEAR commented Apr 11, 2024

With the `current_length` reset

Without the `current_length` reset

Changing `initial_memory_pages` to 1