librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream by BatshevaBlack · Pull Request #1702 · linux-rdma/rdma-core

BatshevaBlack · 2026-02-17T12:13:21Z

Summary

Extend the rsocket implementation in librdmacm so that applications such as Redis, iperf3, and memcached can use rsocket transparently via LD_PRELOAD (librspreload), and so rsocket aligns with more standard Linux socket and I/O behavior.

Motivation

The rsocket library did not fully support several POSIX/Linux interfaces (epoll, select, accept4, sendfile, fcntl64, and various socket options). Applications that rely on these either failed or fell back to TCP. This change extends the rsocket implementation to implement or fix those interfaces, so the preload can intercept them and route traffic over RDMA.

Changes

Add/fix epoll (epoll_create, epoll_create1, epoll_ctl, epoll_wait)
fix rpoll timeout handling and select,
fix cm_svc_run
add accept4, dup, fcntl64, sendfile64;
fix rfcntl
extend getsockopt/setsockopt;
fix SOCK_STREAM/SOCK_DGRAM handling and connect service/TCP behavior;
adjust wake-up timeout from rpoll.

This commit introduces epoll_create functionality to support a centralized thread for managing all epoll instances. The epoll_create call creates an epoll_inst struct and two epoll file descriptors: a "regular epfd" for handling real file descriptors and another epfd that includes the "regular epfd" added using epoll_ctl. The latter epfd is returned from the epoll_create function. Additionally, the new epoll instance is registered with a global thread that processes all instances in a round-robin fashion, efficiently handling events for both regular and rsocket file descriptors. The global thread manages polling in two steps for each epoll instance. First, it iterates through the list of rsocket fds in the epoll struct, polling each one to check for events. Second, it calls epoll_wait on the "regular epfd" to gather events from the real file descriptors. The thread keeps the events in the struct, and proceeds to the next epoll instance. Signed-off-by: Batsheva Black <bblack@nvidia.com>

This commit implements epoll_ctl with tailored handling for real and rsocket file descriptors. For regular file descriptors, epoll_ctl directly operates on the "regular epfd". For rsocket file descriptors, they are added to a dedicated list maintained in the epoll instance struct. This list ensures that the global thread can handle these file descriptors during its polling cycle. epoll_ctl triggers the thread to reprocess the epoll instance to update the ready list. Reflecting any events on the newly added file descriptors. Signed-off-by: Batsheva Black <bblack@nvidia.com>

This commit implements epoll_wait to retrieve events processed by the centralized thread for an epoll instance. When epoll_wait is called, it copies the events collected by the global thread from the ready list in the epoll instance to the user-provided events buffer. If no events are available in the `revents` field, the function triggers the thread to recheck for events. Epoll_wait returns the total number of ready events. Signed-off-by: Batsheva Black <bblack@nvidia.com>

in case of timeout which causes poll to return, clear all signals that arrived by calling rs_poll_exit. Signed-off-by: Batsheva Black <bblack@nvidia.com>

shefty · 2026-03-13T18:57:07Z

poll support is non-trivial. Is there a reason why epoll support was implemented over rpoll?

shefty

See comments. I didn't review the epoll code in detail, as I would have expected the implementation to leverage to rpoll() path.

librdmacm/preload.c

librdmacm/rsocket.c

Without the user_fds mapping, select() could set bits for internal fds instead of the user fds the application passed in, so the wrong (or no) sockets were reported as ready. Keep the list of the fds that are sent to poll in order to know which fd belongs to each rfd when returning the revents to the fds list. Signed-off-by: Batsheva Black <bblack@nvidia.com>

The accept4 implementation extends accept to support the additional atomic flag-setting functionality provided by accept4. Signed-off-by: Batsheva Black <bblack@nvidia.com>

Add preload interception for fcntl64 so rsocket file descriptors support the same flag semantics as the glibc fcntl64 API. Signed-off-by: Batsheva Black <bblack@nvidia.com>

getsockopt: TCP_INFO, TCP_CONGESTION, SO_BROADCAST & IP_TOS. setsockopt: IP_TOS & TCP_CONGESTION. Signed-off-by: Batsheva Black <bblack@nvidia.com>

rfcntl keeps the files flags all in the fd_flags argument. Adding the new field fs_flags to the rs struct allows the fcntl function to keep the file status flags separately from the file descriptor flags. Signed-off-by: Batsheva Black <bblack@nvidia.com>

Add preload interception for sendfile64 so applications using the 64-bit offset sendfile64 API work correctly with rsocket file descriptors. Signed-off-by: Batsheva Black <bblack@nvidia.com>

Add preload interception for dup so that duplicating an rsocket file descriptor produces another rsocket fd that refers to the same connection. Signed-off-by: Batsheva Black <bblack@nvidia.com>

Previously we only added it when rconnect() returned EINPROGRESS. Now also add when connect succeeds so the progress thread can drive state and handle disconnects. Signed-off-by: Batsheva Black <bblack@nvidia.com>

The changes to rpoll to use a signaling fd to wake up blocked threads, combined with suspending polling while rsockets states may be changing _should_ prevent any threads from blocking indefinitely in rpoll() when a desired state change occurs. We periodically wake up any polling thread, so that it can recheck its rsocket states. The sleeping interval was set to an arbitrary value of 5 seconds, this interval is too long for apps that request a connection and are dependent on the thread waking up, so it's changed now to 0.5 seconds, but can be overridden using config files. Signed-off-by: Batsheva Black <bblack@nvidia.com>

Updated type checks to identify socket types even when additional flags are present in the type field. Changed the comparison to use bitwise AND for more accurate detection. Signed-off-by: Batsheva Black <bblack@nvidia.com>

BatshevaBlack · 2026-03-16T11:14:34Z

poll support is non-trivial. Is there a reason why epoll support was implemented over rpoll?

@shefty
Thanks for your comments.
I didn’t use rpoll in the epoll path because the epoll thread would have to build a new pollfd array from the instance’s fds on every iteration, and I wanted to avoid that overhead.
So the thread does the same readiness work rpoll does (via rs_poll_rs) per fd instead of constructing the set and calling rpoll().
Semantics match rpoll, it’s just a different call path. I’m happy to refactor so the thread calls rpoll() if you’d rather have a single path.

shefty · 2026-03-16T23:46:55Z

If you can separate the epoll changes into another PR, that should help merge the other changes quicker.

This epoll implementation doesn't behave the same as poll. For example, if you look at rpoll(), loops are required for proper event handling. Similar loops are missing from repoll_wait(). I'll walk through what rpoll() is doing below.

Here are links to an epoll implementation over poll (called ofi_pollfds). The license is suitable for rdma-core. Note that the implementation targets windows, mac os, and freebsd, so there are extra abstractions that linux only support may not need.

https://github.com/ofiwg/libfabric/blob/main/include/ofi_epoll.h#L276
https://github.com/ofiwg/libfabric/blob/main/src/common.c#L1510

I can walk through what this code is doing if needed. The abstraction itself is NOT full epoll support. It does not handle attaching an epoll fd to another epoll fd, for example. Trying to support that would add significant complexity.

The rpoll() implementation has 2 while loops.

The first loop busy waits for a small period of time before moving to a blocking call. Technically, the use of a loop here isn't needed, as it's an optimization. However, the loop drives progress across the rsockets and checks for events, which is needed. We need to drive progress because an rocket may already have data to read, sitting in the local buffer. If that's the case, the rsocket must be marked for POLLOUT. There's no guarantee additional completions will be written to the rsocket's CQ or events added to its completion channel. The call to rs_poll_check() from rpoll() drives progress on the rsocket, so events aren't lost. rs_poll_check() checks all fd's, including non-rsocket fd's. This ensures that if we find an event on an rsocket, we don't miss also reporting events on non-rsocket fd's.

The second loop goes through the motions are arming the rsocket's CQ prior to blocking in poll(). Note that we call rs_poll_check() again to avoid dropping completions. We have to recheck the CQ for completions after it's been armed to avoid leaving entries on the CQ. If we have any completions to report, we return them, so we don't end up blocked in poll(). Once we're in poll(), if an event occurs, we need to process it. The occurrence of an event doesn't mean that it's an event that should be reported to a user. For example, the user could be waiting for POLLOUT, but we instead receive data. Because the user isn't checking POLLIN, we queue the data, then continue looping until POLLOUT is satisfied.

Looking at repoll_wait() and epoll_wait(), I don't see the above functionality. There's no attempt to drive progress. I don't see where CQ completion channels are armed or where the CQs are being polled. I don't see how multiple threads calling repoll_wait is handled (e.g. see rs_poll_enter / rs_poll_exit). There's a lot of complexity being handled by rpoll() which isn't there for repoll, which makes me question the implementation. That's why layering epoll over rpoll may be a better option, so we don't have to duplicate that complexity. (Yes, layering epoll over poll isn't exactly trivial either...)

Any epoll implementation we have will likely be limited relative to true epoll support. Those limits should be documented.

shefty · 2026-03-17T02:48:36Z

If I'm understand the code, epoll spawns a thread that spins in epoll_thread(). Is that correct?

repoll_wait() also never blocks, effectively ignoring the timeout, which could result in spinning the application thread.

If the above is correct, then I think the changes needed for epoll are to block the thread calling epoll_wait and avoid the epoll thread from spinning.

If we implement epoll using a thread, that thread should probably be idle until epoll_wait is called (to avoid contention driving progress on the same rsocket from 2 threads). The epoll thread should reuse more of the rpoll calls, rather than duplicating a large function like rs_poll_rs() with epoll_rs(). Repacing epoll_rs() with rs_poll_rs() is likely trivial. But I suspect we'll want to leverage additional calls to avoid spinning. It's okay if the rpoll flow needs to change some to make this happen.

BatshevaBlack · 2026-03-17T09:01:46Z

@shefty thanks!
I will go through your input about epoll
in the meantime I opened a new PR not including epoll changes #1717
I will remove the non epoll commits from this branch

BatshevaBlack marked this pull request as draft February 17, 2026 12:13

BatshevaBlack force-pushed the rsocket_upstream branch 3 times, most recently from 227fd60 to 959cb7e Compare February 23, 2026 10:28

BatshevaBlack added 2 commits February 23, 2026 13:04

BatshevaBlack force-pushed the rsocket_upstream branch 12 times, most recently from bee544e to 7e7f2b6 Compare February 24, 2026 08:28

BatshevaBlack added 2 commits February 24, 2026 10:38

librdmacm: Fix rpoll in case of timeout

293e6bf

in case of timeout which causes poll to return, clear all signals that arrived by calling rs_poll_exit. Signed-off-by: Batsheva Black <bblack@nvidia.com>

BatshevaBlack force-pushed the rsocket_upstream branch 2 times, most recently from 0db3217 to fbb8d04 Compare February 24, 2026 08:45

BatshevaBlack marked this pull request as ready for review February 24, 2026 08:59

BatshevaBlack force-pushed the rsocket_upstream branch 7 times, most recently from 1b6eb75 to f774322 Compare February 24, 2026 19:17

BatshevaBlack force-pushed the rsocket_upstream branch 5 times, most recently from e3a3c6d to 840c6d1 Compare February 25, 2026 08:40

shefty reviewed Mar 13, 2026

View reviewed changes

librdmacm/preload.c Show resolved Hide resolved

librdmacm/preload.c Outdated Show resolved Hide resolved

librdmacm/rsocket.c Outdated Show resolved Hide resolved

librdmacm/rsocket.c Outdated Show resolved Hide resolved

BatshevaBlack added 10 commits March 16, 2026 11:23

librdmacm: Add support for accept4 function

5daedfd

The accept4 implementation extends accept to support the additional atomic flag-setting functionality provided by accept4. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Add support for fcntl64

9baee8a

Add preload interception for fcntl64 so rsocket file descriptors support the same flag semantics as the glibc fcntl64 API. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Add support to more optnames in getsockopt, setsockopt

702a02a

getsockopt: TCP_INFO, TCP_CONGESTION, SO_BROADCAST & IP_TOS. setsockopt: IP_TOS & TCP_CONGESTION. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Add support for sendfile64

da9d845

Add preload interception for sendfile64 so applications using the 64-bit offset sendfile64 API work correctly with rsocket file descriptors. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Add support for dup

9719db6

Add preload interception for dup so that duplicating an rsocket file descriptor produces another rsocket fd that refers to the same connection. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Add rsocket to connect service on success too

a0343e5

Previously we only added it when rconnect() returned EINPROGRESS. Now also add when connect succeeds so the progress thread can drive state and handle disconnects. Signed-off-by: Batsheva Black <bblack@nvidia.com>

librdmacm: Fix SOCK_STREAM and SOCK_DGRAM types

4d794fa

Updated type checks to identify socket types even when additional flags are present in the type field. Changed the comparison to use bitwise AND for more accurate detection. Signed-off-by: Batsheva Black <bblack@nvidia.com>

BatshevaBlack force-pushed the rsocket_upstream branch from 840c6d1 to 4d794fa Compare March 16, 2026 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702

librdmacm: extend rsocket for Redis, iperf3, memcached and more Linux APIsRsocket upstream#1702
BatshevaBlack wants to merge 14 commits intolinux-rdma:masterfrom
BatshevaBlack:rsocket_upstream

BatshevaBlack commented Feb 17, 2026

Uh oh!

shefty commented Mar 13, 2026

Uh oh!

shefty left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BatshevaBlack commented Mar 16, 2026

Uh oh!

shefty commented Mar 16, 2026

Uh oh!

shefty commented Mar 17, 2026

Uh oh!

BatshevaBlack commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BatshevaBlack commented Feb 17, 2026

Summary

Motivation

Changes

Uh oh!

shefty commented Mar 13, 2026

Uh oh!

shefty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BatshevaBlack commented Mar 16, 2026

Uh oh!

shefty commented Mar 16, 2026

Uh oh!

shefty commented Mar 17, 2026

Uh oh!

BatshevaBlack commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BatshevaBlack commented Mar 17, 2026 •

edited

Loading