Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UcpRequests leak #11

Open
wants to merge 11 commits into
base: development
Choose a base branch
from

Conversation

kochkozharov
Copy link

@kochkozharov kochkozharov commented Mar 11, 2025

Why

In this PR I fix OOM (Out of Memory) error, which occurs in test described here.

First of all, I reverted 2 previous PRs solving the same because that solution, involving explicit freeing of direct buffers, is error prone. In #8 I freed buffers, that previously was passed to callback, so callback had a broken reference to buffer. I apologize for this bug. In this PR I've found out why direct buffers was not collected by GC and fixed the root of the problem.

The JUCX library's JNI code creates global references for each UcpRequest and removes them when the request is processed, canceled, or (if the request is created by an endpoint) when closeNonBlockingForce() is called. hadroNIO creates many UcpRequests in FillReceiveBuffer() using RecvTaggedNonBlocking (which is not an endpoint method but a worker method) and only some of them are actually processed. This behavior causes GC to fail to collect request objects in the heap, which in turn reference large direct buffers. As a result, if the application is constantly creating new socket channels, an OOM error occurs.

Heap Dump after OOM

image

How

To solve this problem I introduce ConcurrentLinkedQueue that tracks UcpRequests created by worker. In close method of HadronioSocketChannel I implicitly cancel all remaining requests in queue. Also I suggest replacing endpoint.close() (calls deprecated ucp_ep_destroy) with closeNonBlockingForce() to cancel all endpoint requests.

VisualVM graph of direct buffer count. GC periodically collects buffers.

image (1)

Possible refinements

  1. Maybe queue for requests shouldn't be thread safe? Do you provide any guarantees for SocketChannels in concurrent environment?
  2. The problem with UcpRequests (created by RecvTaggedNonBlocking) not being canceled after closing the worker could be solved on the level of UCX. There was already an attempt to introduce function to cancel all requests by tag, but the PR was closed "since the problem of additional incoming message" and it was suggested to use ActiveMessage API. Is it possible to use the ActiveMessage API in hadroNIO instead of the TaggedMessage API or would that be problematic?

@fruhland
Copy link
Contributor

fruhland commented Mar 12, 2025

I programmed hadroNIO with the philosophy in mind to not perform any memory allocations during request processing. The only memory allocations done when sending/receiving/selecting are performed by JUCX. In fact, you can see that JUCX does not scale very well with a lot of connections in my netty microbenchmark, because of this (Paper, E-Mail me if you cannot access it and are interested in the results).

Your new solution will definetly allocate heap memory for managing the queue. Maybe there is a better way, like a fixed size array?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants