Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodical OOM incidents on Testnet storage nodes #1319

Closed
cthulhu-rider opened this issue Apr 21, 2022 · 5 comments
Closed

Periodical OOM incidents on Testnet storage nodes #1319

cthulhu-rider opened this issue Apr 21, 2022 · 5 comments
Assignees
Labels
bug Something isn't working neofs-storage Storage node application issues U3 Regular

Comments

@cthulhu-rider
Copy link
Contributor

Some NeoFS Testnet storage nodes (nagisa, ai, yu) are periodically killed by OOM signal from OS. All these nodes has ~2GB RAM. We need to detect the reason and try to prevent it.

Possible reasons:

  • incoming RPC spikes (server doesn't limit the incoming connections)
  • outgoing RPC spikes (internal routines with API communication)
  • internal work on some event (e.g. new epoch)
  • ???

Observations also show that memory consuming sometimes happens almost simultaneously on different nodes, which can hint either at an external load spike on the container, or global event processing.

@cthulhu-rider cthulhu-rider added bug Something isn't working triage neofs-storage Storage node application issues U2 Seriously planned labels Apr 21, 2022
@cthulhu-rider
Copy link
Contributor Author

Observations:

  1. Config changes (blobovnicza cache)
  2. RAM spike, no problem logs, all nodes consumed ~600M

Possible reproductions:

  1. decrease pprof interval
  2. put huge object into container
  3. observe profile

Virtual consumption is bigger than Go runtime, so mb some direct OS actions cause memory growth.

@fyrchik
Copy link
Contributor

fyrchik commented Apr 27, 2022

To create OOM condition in dev-env:

  1. Restrict the amount of available memory with docker-compose mem_limit: 1g setting.
  2. Try to load 1G file in a container.

On my machine it consistently fails in the middle of a second object put.

@fyrchik
Copy link
Contributor

fyrchik commented Apr 27, 2022

I have conducted some experiments (1g memory on each node as described in the previous post):

  1. Restricting the maximum amount of concurrently executed Put requests makes no difference.
  2. If MaxObjectSize network parameter is set to ~10 MiB (or lower), the problem doesn't appear, at least in the case specified above.
  3. If MaxObjectSize network parameter is increased, the problem arises much faster.
  4. The amount of replicas in placement policy doesn't seem to affect the frequency of OOM situations (tested REP 1 vs REP 3).

@fyrchik
Copy link
Contributor

fyrchik commented Apr 28, 2022

When we put an object in the blobstor https://github.com/nspcc-dev/neofs-node/blob/master/pkg/local_object_storage/blobstor/put.go#L38, we do the following steps:

  1. Marshal the object to check whether is should go in file-tree or blobovnicza
  2. Encode object to a new buffer if the compression is enabled.

So if the object size is close to a maximum (64 MiB), we allocate additional 64 MiB on the step 2 and yet another 64 MiB on the step 3 (actually, if the object is non-compressable, we allocate twice, because the initial capacity is equal to the object size).

This explains sudden increase in the memory consumption right before the OOM.
So I propose to do the following:

  1. Allow a node to overwrite MaxObjectSize setting (when cutting a big object, take the minimum of the actual network parameter and a local setting).
  2. Support streaming compression interface: write to a file directly, or reuse the underlying buffer.
  3. *Support streaming marshaling interfarce (possibly requires a change in the on-disk format, or exported functions from SDK).
  4. After (3) we can check the object size in 2 steps: if the payload is big enough, we can decide that the object is big without marshaling it in memory.

cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022
…te` targets

They work with prepared objects only.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
cthulhu-rider pushed a commit that referenced this issue Apr 29, 2022
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
@carpawell
Copy link
Member

Was fixed.

aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022
…tedTarget`

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022
…and `remote` targets

They work with prepared objects only.

Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
aprasolova pushed a commit to aprasolova/neofs-node that referenced this issue Oct 19, 2022
Signed-off-by: Evgenii Stratonikov <evgeniy@nspcc.ru>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working neofs-storage Storage node application issues U3 Regular
Projects
None yet
Development

No branches or pull requests

4 participants