1.3 -> 1.4 regression bug: Memory usage increases over time

We got reports from multiple network operators that after an upgrade to CosmWasm 1.4 or 1.5 the memory usage increases a lot over time. This is clearly a bug in CosmWasm for which at the point of writing there is no fix. However, there are good mitigation strategies which I'll elaborate in here.

## What's happening

When you run a node with wasmvm 1.4 or 1.5, the memory usage of the process increases over time. The memory usage profile looks like this:

![mem_usage](https://github.com/CosmWasm/cosmwasm/assets/2603011/5a6b2f4c-7981-4705-b022-5b1cb3132e56)
<img width="667" alt="mem_usage2" src="https://github.com/CosmWasm/cosmwasm/assets/2603011/1e849e34-198f-4159-ae04-ba207f07fb92">
<img width="1084" alt="mem_usage3" src="https://github.com/CosmWasm/cosmwasm/assets/2603011/a5b8cfd7-cd4e-463f-bf31-b44fbc5b0120">

You might see also experiences the consequences such as:
1. Node unable to stay in sync with the network because swap is used and the operation is getting too slow
2. Node crashing because it cannot allocate memory. This might e.g. lead to crashes in the Go space or aborts in the Rust code like here:
    ```
    SIGABRT: abort
    PC=0x2b998f1 m=9 sigcode=18446744073709551610
    signal arrived during cgo execution
    
    goroutine 10416 [syscall]:
    runtime.cgocall(0x2121300, 0xc00a78ef58)
    	runtime/cgocall.go:157 +0x4b fp=0xc00a78ef30 sp=0xc00a78eef8 pc=0x456f0b
    github.com/CosmWasm/wasmvm/internal/api._C2func_save_wasm(0x7f2abb664810, {0x0, 0xc00a8d0000, 0x69ab6}, 0x0, 0xc005400ca0)
    	_cgo_gotypes.go:662 +0x65 fp=0xc00a78ef58 sp=0xc00a78ef30 pc=0x135e865
    github.com/CosmWasm/wasmvm/internal/api.StoreCode.func1({0x54c9da0?}, {0xd8?, 0xc00a8d0000?, 0x0?}, 0x0?)
    	github.com/CosmWasm/wasmvm@v1.5.0/internal/api/lib.go:65 +0x97 fp=0xc00a78eff0 sp=0xc00a78ef58 pc=0x13618f7
    github.com/CosmWasm/wasmvm/internal/api.StoreCode({0x1?}, {0xc00a8d0000?, 0x0?, 0x14?})
    ```

## Why it is happening

Every time you load a contract from the file system cache, the memory usage increase (this is the bug). If contracts kick out each other from the in-memory cache, this happens often. If the cache is large enough to hold the majority of actively used contracts, this happens very rarely.

## Workaround

To mitigate the problem, increase the config `wasm.memory_cache_size` in app.toml from 100 MiB to a much larger value depending on the network such as e.g. 2000 MiB:

```toml
[wasm]
# other wasm config entries
memory_cache_size = 2000 # MiB
```

This is a per-node configuration and needs to be done on every node.

### How lage should the cache be?

This depends on the usage patterns of the network and the size of the compiled modules. Being able to store all contracts in memory would be one extreme that might make sense for permissioned CosmWasm chains. Permissionless chains are likely to have contracts that are almost never used.

To get a rough idea of the oder of magnitude, you can check the size of the modules using something like this:
* CosmWasm 1.3: `du -hs ~/.myd/wasm/wasm/cache/modules/v6-*`
* CosmWasm 1.4: `du -hs ~/.myd/wasm/wasm/cache/modules/v7-*` 
* CosmWasm 1.5: `du -hs ~/.myd/wasm/wasm/cache/modules/v8-*`

### Complementary strategies

The above setting is the most important thing. But there is more you can do, like
- Increase memory
- Observe memory usage. The sympthoms are different for every blockchain and every node.
- Consider memory usage alerting
- Enable swap to avoid immediate hard crashes in case of overusage
- Schedule clean node restarts from time to time

Overall bear in mind I am not a node operator and I don't know the specifics of your blockchain or system. So I cannot make complete and final recommendations.

## The bug

The bug can be reproducted locally in a pure-Rust example using heap profiling shown in https://github.com/CosmWasm/cosmwasm/pull/1955. The tools shows us that the memory usage increases over time but is almost zero when the process is ending cleanly. This means this is not a memory leak but rather an undesired memory usage pattern.

<img width="1336" alt="Bildschirmfoto 2023-12-23 um 10 30 03" src="https://github.com/CosmWasm/cosmwasm/assets/2603011/39bdb8c4-1790-4929-b541-2f7cbd113834">

This is where the allocations are made. At max memory usage time (t-gmax), 96% are coming through `cosmwasm_vm::modules::file_system_cache::FileSystemCache::load`.

At this point it is not clear to me if this is a bug in Wasmer, rkyv or cosmwasm-vm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

What's happening

Why it is happening

Workaround

How lage should the cache be?

Complementary strategies

The bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

1.3 -> 1.4 regression bug: Memory usage increases over time #1978

Description

What's happening

Why it is happening

Workaround

How lage should the cache be?

Complementary strategies

The bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions