Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible socket leakage with direct file descriptors #1192

Open
romange opened this issue Jul 29, 2024 · 32 comments
Open

possible socket leakage with direct file descriptors #1192

romange opened this issue Jul 29, 2024 · 32 comments

Comments

@romange
Copy link
Contributor

romange commented Jul 29, 2024

the pseudo code below constantly increases sockets: used metric in /proc/net/sockstat
This reproduces only when registering direct file descriptors (on kernel 6.2.0).

for (unsigned i = 0; i < 10000; ++i) {
  socket_fd = CreateSocket(AF_INET);
  // register
  int res = io_uring_register_files_update(&ring_, 0 &socket_fd, 1);
  ASSERT(res ==1);
  // unregister
  int unreg_id = -1;
  res = io_uring_register_files_update(&ring_, 0 &unreg_id, 1);
  ASSERT(res ==1);
  close(socket_fd);
  usleep(1000);
}

what's interesting is that number of file descriptors does not grow.
killing the process restores socket counts in /proc/net/sockstat to a normal level.

romange added a commit to romange/helio that referenced this issue Jul 29, 2024
Fixing the registration code - the direct fd table is not resizable.
Add a test reproducing the socket leakage when using direct fds.
Opened axboe/liburing#1192 to follow up.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@axboe
Copy link
Owner

axboe commented Jul 29, 2024

They should reap eventually, but the putting is async hence why it can take a bit of time. Newer than 6.2 will generally keep up better. I ran your example on 6.2 and it does reap as it goes, just a bit behind. Once I sleep after the loop, the rest goes pretty quickly.

@romange
Copy link
Contributor Author

romange commented Jul 29, 2024

How long should should I wait?
We had a case in production where server had hundreds of thousands of such sockets which translated to gigabytes of memory Slab pages.

@romange
Copy link
Contributor Author

romange commented Jul 29, 2024

I added sleep after the loop but the sockstat still shows me over 100K sockets that do not go down.
image

romange added a commit to romange/helio that referenced this issue Jul 29, 2024
Fixing the registration code - the direct fd table is not resizable.
Add a test reproducing the socket leakage when using direct fds.
Opened axboe/liburing#1192 to follow up.

Finally, disabled direct fds for sockets.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@axboe
Copy link
Owner

axboe commented Jul 29, 2024

It should not take that long, unless perhaps something is keeping everything busy on the server? Are you running mainline 6.2, or something else? 6.2-stable isn't a thing, it's been dead since May 2023. Or is it some weird distro thing?

I'd recommend going to 6.6-stable, as that's also an LTS release. This means that if there are fixes for io_uring (and other things) in 6.6, they will get fixed for years rather than be a dead end like 6.2 is.

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

@axboe please observe the run on 6.8 kernel (ubuntu 24.04).

demo.mp4

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

strace shows that the calls are consistent with my pseudo code:

....
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
[pid 1808295] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000}, NULL) = 0
[pid 1808295] socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 6
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[6]}, 1) = 1
[pid 1808295] io_uring_register(4, IORING_REGISTER_FILES_UPDATE, {offset=1, fds=[-1]}, 1) = 1
[pid 1808295] close(6)                  = 0
....

romange added a commit to romange/helio that referenced this issue Jul 30, 2024
Fixing the registration code - the direct fd table is not resizable.
Add a test reproducing the socket leakage when using direct fds.
Opened axboe/liburing#1192 to follow up.

Finally, disabled direct fds for sockets.

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@axboe
Copy link
Owner

axboe commented Jul 30, 2024

I'll take a look. Can you share your .config?

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

And also please include details on the system in question, mostly just how many cores you have (and arch).

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

@axboe I can confirm that your code does not reproduce the problematic behavior.
I will continue investigating until I find the differentiator that triggers it.

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

Seems the problem on my side. Again!!! 🤦🏼‍♂️
Sorry for wasting your time. On a bright side, seems that 6.2 is working perfectly well!!!

@romange romange closed this as completed Jul 30, 2024
@axboe
Copy link
Owner

axboe commented Jul 30, 2024

Seems the problem on my side. Again!!! 🤦🏼‍♂️

All good! But would you mind mentioning what the issue was on your side? Might help others that run into the same thing. Always nice to have some sort of conclusion to an issue filed.

@romange romange reopened this Jul 30, 2024
@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

I couldn't identify the root cause in my previous comment, but I have now. I'm pasting the example that reproduces the problem. It seems that another socket is needed to cause the leakage of socket descriptors.

#include <stdio.h>
#include <fcntl.h>
#include <sys/socket.h>
#include <unistd.h>
#include <assert.h>
#include <liburing.h>
#include <pthread.h>
#include <sched.h>
#include <malloc.h>
#include <poll.h>
#include <string.h>
#include <netinet/in.h>

int main(int argc, char *argv[])
{
	int fd, res, i, unreg_id, done = 0;
	struct io_uring ring;
	int listen_fd;
	sockaddr_in server_addr;
	io_uring_sqe* sqe;
	

	io_uring_queue_init(32, &ring, 0);
	io_uring_register_files_sparse(&ring, 2);
	
	listen_fd = socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP);
	res = io_uring_register_files_update(&ring, 0, &listen_fd, 1);
	assert(res == 1);
	
	memset(&server_addr, 0, sizeof(server_addr));
  	server_addr.sin_family = AF_INET;
  	server_addr.sin_port = htons(6666);
  	server_addr.sin_addr.s_addr = INADDR_ANY;

	res = bind(listen_fd, (sockaddr*)&server_addr, sizeof(server_addr));
	assert(res == 0);
	res = listen(listen_fd, 32);
	assert(res == 0);
	// accept4(listen_fd, NULL, NULL, SOCK_NONBLOCK | SOCK_CLOEXEC);
	sqe = io_uring_get_sqe(&ring);
	
	// If I remove the part below - there is no leakage.
	io_uring_prep_poll_add(sqe, 0, POLLIN);
	sqe->flags |= IOSQE_FIXED_FILE;
	io_uring_submit(&ring);
	///////////////////////

	for (i = 0; i < 10000; i++)
	{
		fd = socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP);
		assert(fd >= 0);

		// register
		res = io_uring_register_files_update(&ring, 1, &fd, 1);
		assert(res == 1);

		// unregister
		unreg_id = -1;
		res = io_uring_register_files_update(&ring, 1, &unreg_id, 1);
		done++;
		assert(res == 1);
		close(fd);		
	}

	printf("registered %d sockets\n", done);
	sleep(60);
	io_uring_queue_exit(&ring);

	return 0;
}

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

That's what I suspected, I did manage to make a reproducer for this as well. They are freed in order, but stop if one isn't available to free. Let me poke a bit, don't think this needs to be the case.

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

Ah yours looks different - you are still holding the file open, because you have that poll operation pending. You'll want to cancel the operation(s) on that socket before the file will be fully closed. So don't think your reproducer is valid, it just shows leaving a reference to that fd/socket via a pending operation will prevent it from being closed.

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

But I think it's a combination of the two - you have the one that prevents it from being closed, and that'll prevent the others too. That is certainly not right, and this is the issue I mentioned in the previous-to-last comment.

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

but I do not want to close listen_fd. It's not closed by design and it uses direct fd=0.
I want to close another 10000 sockets that come afterwards and hold index 1.

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

Oh I agree, your example is way better than mine as it was all just the same. So just to make it clear, I agree this is an issue, it should not hold up other nodes from being freed.

@romange
Copy link
Contributor Author

romange commented Jul 30, 2024

I am just happy that could be helpful this time! (even a broken clock is right twice a day)

@axboe
Copy link
Owner

axboe commented Jul 30, 2024

In terms of work-around, may be a good idea to have long standing things (like a listen) NOT be a fixed file, at least for now. Should be limited, if any, utility of having that fixed, unless you have a high frequency of connections coming in. This doesn't totally solve the issue as even homogeneous type of requests may have drastically different life times, and one switch of nodes (eg when you update to -1) will potentially block if chronologically previous entries are still active.

We will certainly fix this, but it may require a bit of work and hence isn't necessarily something that can go to a stable kernel in a short amount of time.

@romange
Copy link
Contributor Author

romange commented Jul 31, 2024

Yeah, that won't fly. We have high frequency connections usecases, so I can not rely on that for sockets.
But, it's fine - we have not been using direct fds so far, so it's not a blocker for us.

It seems ok for files though, and we have a big feature coming up related to SSD tiering. I am going to use direct fds there.

@axboe
Copy link
Owner

axboe commented Jul 31, 2024

For finite lifetime requests, like any storage IO would be, then direct descriptors would always be fine. It's only these potentially infinite or long lived requests that are problematic, if you get stuck behind reclaim of them. Still pondering a way to make that better, but it's going to be a 6.12 thing, at the earliest.

@romange
Copy link
Contributor Author

romange commented Oct 23, 2024

@axboe has it be resolved by any chance?

@axboe
Copy link
Owner

axboe commented Oct 23, 2024

Unfortunately no, it dropped off my radar. I'll take a look at it very soon.

@axboe
Copy link
Owner

axboe commented Oct 25, 2024

Got a prototype of this going... Stay tuned.

@axboe
Copy link
Owner

axboe commented Oct 26, 2024

Yep confirmed it works, the previous test case would keep 10k sockets open until the ring went away, unless something got canceled/finished, but with the change it's staying steady at 88-89 sockets.

@axboe
Copy link
Owner

axboe commented Oct 26, 2024

Work in progress, patches here so far: https://git.kernel.dk/cgit/linux/log/?h=io_uring-rsrc

@oxalica
Copy link

oxalica commented Nov 6, 2024

I'm experiencing that READ on a connected TCP socket hang indefinitely even after the other CLOSEd, when using io-uring direct fds. netstat shows two sockets a still alive, but /proc/<pid>/fd{,info} shows the peer socket is closed. Could it be the same underlying issue as this?

EDIT: I'm on Linux 6.6.59.

My repro is in Rust with io-uring crate: https://gist.github.com/oxalica/666d7c6146bab4b1d6b0c974561c2daf
But it's basically:

  1. listen(), spawning a thread to connect(), accept(), all using regular syscalls.
  2. Register the accepted and connected fds (TCP stream) as direct fds in an io-uring.
  3. Close all socket fds outside io-uring.
  4. Submit READ from one socket and CLOSE the other socket in io-uring via direct fds. READ SQE should be before CLOSE SQE, to trigger the issue.
  5. io_uring_enter(), only CLOSE reports success, and the other READ hang indefinitely.

@axboe
Copy link
Owner

axboe commented Nov 6, 2024

Direct descriptors keep a reference to a file, so when you close it, there's still another reference on it. You need to use an explicit shutdown to terminate the read/recv in that case. So no, not the same problem, it's just an artifact of how direct descriptors work.

@oxalica
Copy link

oxalica commented Nov 6, 2024

Direct descriptors keep a reference to a file, so when you close it, there's still another reference on it.

Could you elaborate where is that reference held, after being explicitly closed in all places? The READ op is issued on the TCP peer socket, not the closed one. The closed socket is only registered and then closed, without any other operation.

@axboe
Copy link
Owner

axboe commented Nov 6, 2024

  1. You open a socket, now there's 1 reference to the file representing the socket
  2. You register this socket as a direct descriptor. Now there's 2 references to the file. The direct descriptor is wrapped in a resource node, which holds this 2nd reference to the file. The resource node is what is in the ring file descriptor table.
  3. You close the original socket you opened, now there's 1 reference to the file
  4. You have a recv/read against this socket, that holds a reference to the resource node holding the direct descriptor
  5. You unregister the direct descriptor, that drops a reference to the resource node. But the read/recv is still holding a reference to this node, and the file still has a reference of 1 at this point. This reference cannot go away until the read finishes

This is why I say you need to do an explicit io_uring_prep_shutdown() on the socket, which will then in turn trigger the read and it will complete.

@axboe
Copy link
Owner

axboe commented Nov 6, 2024

If I'm misunderstanding your use case, do tell!

@axboe
Copy link
Owner

axboe commented Nov 6, 2024

If I am and it's an unrelated socket, then yes it may indeed be pinned by the per-ring node that 6.6 uses and the patchset referenced in here (which is queued for 6.13) will indeed fix your problem too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants