Periodical OOM incidents on Testnet storage nodes #1319

cthulhu-rider · 2022-04-21T08:57:05Z

Some NeoFS Testnet storage nodes (nagisa, ai, yu) are periodically killed by OOM signal from OS. All these nodes has ~2GB RAM. We need to detect the reason and try to prevent it.

Possible reasons:

incoming RPC spikes (server doesn't limit the incoming connections)
outgoing RPC spikes (internal routines with API communication)
internal work on some event (e.g. new epoch)
???

Observations also show that memory consuming sometimes happens almost simultaneously on different nodes, which can hint either at an external load spike on the container, or global event processing.

The text was updated successfully, but these errors were encountered:

cthulhu-rider · 2022-04-21T11:03:52Z

Observations:

Config changes (blobovnicza cache)
RAM spike, no problem logs, all nodes consumed ~600M

Possible reproductions:

decrease pprof interval
put huge object into container
observe profile

Virtual consumption is bigger than Go runtime, so mb some direct OS actions cause memory growth.

fyrchik · 2022-04-27T08:00:13Z

To create OOM condition in dev-env:

Restrict the amount of available memory with docker-compose mem_limit: 1g setting.
Try to load 1G file in a container.

On my machine it consistently fails in the middle of a second object put.

fyrchik · 2022-04-27T13:16:12Z

I have conducted some experiments (1g memory on each node as described in the previous post):

Restricting the maximum amount of concurrently executed Put requests makes no difference.
If MaxObjectSize network parameter is set to ~10 MiB (or lower), the problem doesn't appear, at least in the case specified above.
If MaxObjectSize network parameter is increased, the problem arises much faster.
The amount of replicas in placement policy doesn't seem to affect the frequency of OOM situations (tested REP 1 vs REP 3).

fyrchik · 2022-04-28T09:53:59Z

When we put an object in the blobstor https://github.com/nspcc-dev/neofs-node/blob/master/pkg/local_object_storage/blobstor/put.go#L38, we do the following steps:

Marshal the object to check whether is should go in file-tree or blobovnicza
Encode object to a new buffer if the compression is enabled.

So if the object size is close to a maximum (64 MiB), we allocate additional 64 MiB on the step 2 and yet another 64 MiB on the step 3 (actually, if the object is non-compressable, we allocate twice, because the initial capacity is equal to the object size).

This explains sudden increase in the memory consumption right before the OOM.
So I propose to do the following:

Allow a node to overwrite MaxObjectSize setting (when cutting a big object, take the minimum of the actual network parameter and a local setting).
Support streaming compression interface: write to a file directly, or reuse the underlying buffer.
*Support streaming marshaling interfarce (possibly requires a change in the on-disk format, or exported functions from SDK).
After (3) we can check the object size in 2 steps: if the payload is big enough, we can decide that the object is big without marshaling it in memory.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

…te` targets They work with prepared objects only. Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

carpawell · 2022-08-11T09:46:19Z

Was fixed.

…tedTarget` Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

…and `remote` targets They work with prepared objects only. Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

cthulhu-rider added bug Something isn't working triage neofs-storage Storage node application issues U2 Seriously planned labels Apr 21, 2022

cthulhu-rider removed the triage label Apr 21, 2022

fyrchik mentioned this issue Apr 29, 2022

Some fixes for object service #1343

Merged

cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022

[#1319] services/object: Store payload directly in distributedTarget

4ea03c0

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022

[#1319] services/object: Remove Write method from local and `remo…

057d534

…te` targets They work with prepared objects only. Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022

[#1319] services/object_manager: Fix error message

1219ff8

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022

[#1319] blobstor: Compress big objects in a streaming fashion

3c39e6d

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

fyrchik mentioned this issue May 19, 2022

Optimize node memory consumption #1398

Open

11 tasks

carpawell closed this as completed Aug 11, 2022

alexchetaev added U3 Regular 2022Q4 labels Aug 23, 2022

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022

[nspcc-dev#1319] services/object: Store payload directly in `distribu…

4b4d49a

…tedTarget` Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022

[nspcc-dev#1319] services/object: Remove Write method from local …

593887c

…and `remote` targets They work with prepared objects only. Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022

[nspcc-dev#1319] services/object_manager: Fix error message

8150d48

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022

[nspcc-dev#1319] blobstor: Compress big objects in a streaming fashion

b5f3df0

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodical OOM incidents on Testnet storage nodes #1319

Periodical OOM incidents on Testnet storage nodes #1319

cthulhu-rider commented Apr 21, 2022

cthulhu-rider commented Apr 21, 2022

fyrchik commented Apr 27, 2022

fyrchik commented Apr 27, 2022

fyrchik commented Apr 28, 2022 •

edited

Loading

carpawell commented Aug 11, 2022

Periodical OOM incidents on Testnet storage nodes #1319

Periodical OOM incidents on Testnet storage nodes #1319

Comments

cthulhu-rider commented Apr 21, 2022

cthulhu-rider commented Apr 21, 2022

fyrchik commented Apr 27, 2022

fyrchik commented Apr 27, 2022

fyrchik commented Apr 28, 2022 • edited Loading

carpawell commented Aug 11, 2022

fyrchik commented Apr 28, 2022 •

edited

Loading