Fix UcpRequests leak #11

kochkozharov · 2025-03-11T01:54:42Z

Why

In this PR I fix OOM (Out of Memory) error, which occurs in test described here.

First of all, I reverted 2 previous PRs solving the same because that solution, involving explicit freeing of direct buffers, is error prone. In #8 I freed buffers, that previously was passed to callback, so callback had a broken reference to buffer. I apologize for this bug. In this PR I've found out why direct buffers was not collected by GC and fixed the root of the problem.

The JUCX library's JNI code creates global references for each UcpRequest and removes them when the request is processed, canceled, or (if the request is created by an endpoint) when closeNonBlockingForce() is called. hadroNIO creates many UcpRequests in FillReceiveBuffer() using RecvTaggedNonBlocking (which is not an endpoint method but a worker method) and only some of them are actually processed. This behavior causes GC to fail to collect request objects in the heap, which in turn reference large direct buffers. As a result, if the application is constantly creating new socket channels, an OOM error occurs.

Heap Dump after OOM

How

To solve this problem I introduce ConcurrentLinkedQueue that tracks UcpRequests created by worker. In close method of HadronioSocketChannel I implicitly cancel all remaining requests in queue. Also I suggest replacing endpoint.close() (calls deprecated ucp_ep_destroy) with closeNonBlockingForce() to cancel all endpoint requests.

VisualVM graph of direct buffer count. GC periodically collects buffers.

Possible refinements

Maybe queue for requests shouldn't be thread safe? Do you provide any guarantees for SocketChannels in concurrent environment?
The problem with UcpRequests (created by RecvTaggedNonBlocking) not being canceled after closing the worker could be solved on the level of UCX. There was already an attempt to introduce function to cancel all requests by tag, but the PR was closed "since the problem of additional incoming message" and it was suggested to use ActiveMessage API. Is it possible to use the ActiveMessage API in hadroNIO instead of the TaggedMessage API or would that be problematic?

…ct-buffer-issue" This reverts commit 66921c6, reversing changes made to 380e7e3.

This reverts commit 045879b, reversing changes made to e680e49.

fruhland · 2025-03-12T13:10:01Z

I programmed hadroNIO with the philosophy in mind to not perform any memory allocations during request processing. The only memory allocations done when sending/receiving/selecting are performed by JUCX. In fact, you can see that JUCX does not scale very well with a lot of connections in my netty microbenchmark, because of this (Paper, E-Mail me if you cannot access it and are interested in the results).

Your new solution will definetly allocate heap memory for managing the queue. Maybe there is a better way, like a fixed size array?

kochkozharov added 11 commits January 28, 2025 17:16

Revert "Merge pull request hhu-bsinfo#8 from kochkozharov/bugfix/dire…

94c9a7a

…ct-buffer-issue" This reverts commit 66921c6, reversing changes made to 380e7e3.

Revert "Merge pull request hhu-bsinfo#5 from kochkozharov/development"

3feafdd

This reverts commit 045879b, reversing changes made to e680e49.

Cancel all remaining UcpRequests when closing the socket

a810e14

Use thread-safe storage for requests

84ad27c

Reduce INFO logging

ed916d0

Null check at close

b09b250

Replace deprecated close to closeNonBlocking in JucxEndpoint

d93640d

Code style

b1ed959

Remove already compeleted requests from pending queue

84b85f9

Formatting

cd12f81

Fix: remove closing of the worker in JucxEndpoint

e3871b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UcpRequests leak #11

Fix UcpRequests leak #11

kochkozharov commented Mar 11, 2025 •

edited

Loading

fruhland commented Mar 12, 2025 •

edited

Loading

Fix UcpRequests leak #11

Are you sure you want to change the base?

Fix UcpRequests leak #11

Conversation

kochkozharov commented Mar 11, 2025 • edited Loading

Why

How

Possible refinements

fruhland commented Mar 12, 2025 • edited Loading

kochkozharov commented Mar 11, 2025 •

edited

Loading

fruhland commented Mar 12, 2025 •

edited

Loading