Epoll event loop (linux) #14814

ysbaddaden · 2024-07-16T16:34:34Z

Implements a custom event loop on top of epoll for Linux (and Solaris?), following RFC #7. It's not particularly optimized yet (e.g. it still allocates), but seems to be working.

The EventQueue and Timers types aren't the best solutions (performance wise) but they work and abstract their own logic (keep a list of events per fd / keep a sorted list of timers). We should be able to improve (e.g. linked lists -> ordered linked lists -> skiplists, ...)

ISSUES

OPTIMIZE

epoll_event.data should point to Node instead of fd, so we could skip searches;
use eventfd instead of pipe2 to interrupt the loop;
one timerfd per eventloop should be enough (wait until next timer => we can't enqueue another timer on that thread anymore => for MT a parallel enqueue will interrupt epoll_wait); we still need a timerfd because epoll_wait only has millisecond precision; we also need a list of timers to know the next timer, and which timers are ready.

TODO

don't raise BUG exceptions, print BUG warnings to STDERR;
more testing: please help to break it 🔨
extract fix for IO::FileDescriptor & Socket finalizers do far too much #14807 (distinct PR);
extract fix for double event dequeue in Fiber#cancel_timeout (distinct PR);
cleanup Crystal::Epoll::EventLoop (commented code, debug traces);
use :ev_epoll flag instead of forcing it for Linux (not yet: I want CI test runs);

ysbaddaden · 2024-07-16T16:52:46Z

Urgh, supporting fork is painful. I must close the timerfds for each fiber (easy) but then I must mutate all pending events to use the new fd 😩

yxhuvud

I trust you have read https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/ (and part 2: https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/? ) The answer to multitheaded issiues may be found in there.

yxhuvud · 2024-07-16T18:43:42Z

src/crystal/system/unix/epoll/event_loop.cr

+    raise "BUG: #{node.fd} is ready for reading but no registered reader for #{node.fd}!\n" if readable
+    raise "BUG: #{node.fd} is ready for writing but no registered writer for #{node.fd}!\n" if writable


Is it sufficent to do normal raise here? I imagine it can be a bit weird in the multithreaded scenario if things raise here. As this is called from the scheduler itself I imagine most places are not really build to handle this well.

Will it bubble up properly and take down the whole thing (as at least I think it should)? Or could we get strange situations with some crashed threads but the program as a whole remains (potentially waiting for the crashed stuff).

I suppose that goes for other exceptions in this vicinity too.

It's not meant to stay, and definitely not safe to handle.

All the raised "BUG:" are meant to loudly detect errors. If they raise, there's a 99% probability that I made a mistake, though I discovered the double event dequeue in Fiber#cancel_timeout thanks to one of these.

src/crystal/system/unix/epoll/event_queue.cr

src/fiber.cr

ysbaddaden · 2024-07-16T20:38:00Z

I took care to use EPOLLEXCLUSIVE (available since 2017). I'm not sure we need EPOLLONESHOT since we register the fd once then dequeue readers/writers one by one. Now, there may still be races, especially in a MT environment, but there's little we can do without blocking the other readers/writers (there's no telling if a reader/writer will keep reading/writing).

I didn't know the close quirk. That explains why we unregister the file descriptors before we actually close. Now, if a fd is in multiple epoll instances, oops, ~~this is likely already broken today~~ 😨 this is working today: we remove the events and the fd from all event bases that referenced the IO object.

ysbaddaden · 2024-07-17T11:32:23Z

The close quirk needs some experimental verification:

Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?

A6 Yes.

https://linux.die.net/man/4/epoll

The one thing to take care of would be to resume the cached event/fibers from all epoll instances (tricky).

I also found an issue with EPOLLEXCLUSIVE: we can't modify, so either we don't use it or we must del/add when changing the set of events.

yxhuvud · 2024-07-17T15:01:47Z

The close quirk

Could be they fixed it at some point, there has been a bunch of years since that post was written after all..

ysbaddaden · 2024-07-17T15:26:16Z

Another man page, a longer answer:

Yes, but be aware of the following point. A file descriptor is a
reference to an open file description (see open(2)). Whenever a
descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or
fork(2), a new file descriptor referring to the same open file de-
scription is created. An open file description continues to exist
until all file descriptors referring to it have been closed. A
file descriptor is removed from an epoll set only after all the
file descriptors referring to the underlying open file description
have been closed (or before if the descriptor is explicitly removed
using epoll_ctl() EPOLL_CTL_DEL). This means that even after a
file descriptor that is part of an epoll set has been closed,
events may be reported for that file descriptor if other file de-
scriptors referring to the same underlying file description remain
open.

https://man.freebsd.org/cgi/man.cgi?query=epoll&apropos=0&sektion=0&manpath=SuSE

ysbaddaden · 2024-07-18T14:37:52Z

The preview_mt issues are caused by thread B closing a fd while thread A is waiting; since each thread has its own eventloop this is creating issues, I end up not cleaning up the fd from thread A' epoll instance.

I'm trying to think of an efficient way to iterate the event loops on (something lighter than ThreadLocalValue), but may resort to iterate all EL for starters (~~and check if it fixes the MT issues~~ that fixes the issue).

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for now and see later if we can replace it with a pair of EPOLL_CTL_DEL and EPOLL_CTL_ADD.

yxhuvud · 2024-07-18T16:56:07Z

Perhaps it is possible to have a lookup table fd -> loop (or probably - a bit less contentious in mt scenario - have a list in the FileDescriptor (well no, but you perhaps better understand what I mean compared to other choices of words) that keeps track of what epolls uses it). But it is an icky problem :(

Process.run sometimes hang forever after fork and before exec, because it tries to close a fd that requires to lock, but another thread may have already acquired the lock, while `fork` only duplicates the current thread (the other ones are not, and the forked process was left waiting for a mutex to be unlocked, which would never happen.

That required to allocate a Node for the interrupt event, which ain't a bad idea.

yxhuvud · 2024-07-18T21:34:17Z

src/crystal/system/unix/epoll/event_loop.cr

-
-    byte = 1_u8
-    LibC.write(@pipe[1], pointerof(byte), 1)
+    # the atomic makes sure we only write once


Is writing more than once a problem when using eventfd? Could be an alternative to simply write multiple times and not have the atomic variable at all. The eventfd will just return the sum of values that has been written to it, after all (unless using the semaphore flag).

(BTW, this method is to interrupt the blocking wait, right? Sometimes the term notify is used for that)

You're right. It's not needed as for the pipe (we can write 0xfffffffffffffffe times until it blocks) but it helps avoid pointless write syscalls, so ⚖️ ?

ysbaddaden · 2024-07-20T09:33:51Z

It looks like it's starting to work... except for the interpreter that now hangs forever 😭

ysbaddaden · 2024-07-20T09:55:57Z

Using a release compiler (with libevent), the interpreter segfaults 😭

Building a compiler with epoll (as CI does) in non release, the interpreter enters a kind of infinite loop (100% CPU, silent strace) after printing the first spec dot (.). The last strace logs are:

epoll_create1(EPOLL_CLOEXEC)            = 13
eventfd2(0, EFD_CLOEXEC)                = 14
timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 15
epoll_ctl(13, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=1958253120, u64=140490338505280}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=1958253072, u64=140490338505232}}) = 0
write(11, "\33[", 2
rite(11, "32", 232)                      = 2
write(11, "m", 1m)                       = 1
write(11, ".", 1.)                       = 1
write(11, "\33[", 2
rite(11, "0", 10)                       = 1
write(11, "m", 1m)                       = 1

It's creating the eventloop instance (epoll, eventfd, timerfd) then writes a colored . then nothing.

Any pointers to debug the interpreter itself?

ysbaddaden · 2024-07-20T10:50:03Z

I can't get tracing to work in the interpreted code, so I hacked myself into Crystal.trace and I see an infinite list of sched.event_loop traces 🤨

ysbaddaden · 2024-07-20T10:57:44Z

I only checked whether @events was empty and didn't check if @timers was, too 🤦

ysbaddaden · 2024-07-20T12:59:49Z

CI is finally green and the implementation can be reviewed!

yxhuvud

What is the plan for rolling out this? Does it need some sort of opt-in flag for a version or two so that people can try and find issues? Or is it a 2.0 thing?

src/crystal/system/unix/epoll/event_loop.cr

yxhuvud · 2024-07-22T07:54:51Z

src/crystal/system/unix/epoll/event_loop.cr

+      size = LibC.recvfrom(socket.fd, slice, slice.size, 0, sockaddr, pointerof(addrlen))
+      if size == -1
+        if Errno.value == Errno::EAGAIN
+          wait_readable(socket.fd, socket.@read_timeout)


So one potential issue here is the following potentially chain of confusing behavior:

-> recvfrom -> EAGAIN
-> waiting a while, say timeout half time span
-> recvfrom -> EAGAIN
.. etc, potentially waiting forever, never timing out.

But treating the timeout as a deadline is perhaps a different issue than this pr.

Good catch.I'm 99% sure this problem is already in the libevent event loop. That doesn't mean we should reproduce it here.

ysbaddaden · 2024-07-22T10:05:40Z

how to roll out

Either opt-in with a flag... or we're bold and make it the default + a flag to opt-out. That will depend on how stable/performant it will get until the next release.

src/crystal/system/unix/epoll/event_loop.cr

ysbaddaden · 2024-07-23T18:01:32Z

src/crystal/system/unix/epoll/event.cr

+  getter type : Type
+  getter fd : Int32
+
+  property! time : Time::Span?


TODO: IOCP uses wake_at and it's much better name than time!

ysbaddaden · 2024-07-29T20:44:30Z

Moving back to draft. There are some fixes in #14829 and... I got great ideas from the Go implementation that would allow to skip most of the synchronizations and take full advantage of epoll/kqueue (see ysbaddaden/execution_context#30).

straight-shoota · 2024-09-04T15:32:10Z

Superseded by #14959

ysbaddaden added 4 commits July 15, 2024 20:09

Fix: simplify IO::FileDescriptor#finalize

265b4d3

Fix: don't cancel timeout select action event twice

da4044e

Add :evloop to Crystal.trace

e89e680

Epoll: initial attempt (doesn't compile)

9172c4c

ysbaddaden added topic:stdlib:concurrency topic:stdlib:system labels Jul 16, 2024

ysbaddaden self-assigned this Jul 16, 2024

Fix: epoll_event is only packed on x86_64

df41ba7

yxhuvud reviewed Jul 16, 2024

View reviewed changes

ysbaddaden added 2 commits July 18, 2024 16:50

Fix: disable EPOLLEXCLUSIVE for now

fc50413

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for now and see later if we can replace it with a pair of EPOLL_CTL_DEL and EPOLL_CTL_ADD.

Fix: close in MT environment

9a5053e

ysbaddaden added 6 commits July 18, 2024 19:36

Fix: after_fork (no MT) or after_fork_before_exec (MT only)

bba4a62

fixup! Fix: add optional Crystal::EventLoop#after_fork_before_exec (MT)

02f0f06

Prefer eventfd over pipe (only one fd, smaller struct in kernel)

d3458bd

Save pointer to Node instead of fd (skips searches after wait)

9969cf5

That required to allocate a Node for the interrupt event, which ain't a bad idea.

fixup! Save pointer to Node instead of fd (skips searches after wait)

4911214

yxhuvud reviewed Jul 18, 2024

View reviewed changes

ysbaddaden added 5 commits July 19, 2024 10:00

fixup! Prefer eventfd over pipe (only one fd, smaller struct in kernel)

66046f3

Add Crystal::System::EventFD abstraction

38f7224

Use generic :system event type instead of :interrupt

75d2093

fixup! Add Crystal::System::EventFD abstraction

a36a214

Extract timers + cleanup + one timerfd per eventloop

a5a68f2

Fix: also check that timers are empty (not only events)

4d4c068

ysbaddaden marked this pull request as ready for review July 20, 2024 12:55

yxhuvud reviewed Jul 22, 2024

View reviewed changes

ysbaddaden commented Jul 22, 2024

View reviewed changes

src/crystal/system/unix/epoll/event_loop.cr Outdated Show resolved Hide resolved

ysbaddaden force-pushed the feature/epoll-event-loop branch from 2fc2f99 to 4d4c068 Compare July 22, 2024 14:23

Fix: missing mutex sync

0d36f67

ysbaddaden commented Jul 23, 2024

View reviewed changes

ysbaddaden mentioned this pull request Jul 23, 2024

Kqueue event loop (*BSD, macOS) #14829

Closed

ysbaddaden marked this pull request as draft July 29, 2024 20:39

ysbaddaden mentioned this pull request Sep 3, 2024

EventLoop: direct epoll/kqueue integration #14959

Closed

straight-shoota closed this Sep 4, 2024

ysbaddaden deleted the feature/epoll-event-loop branch September 6, 2024 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epoll event loop (linux) #14814

Epoll event loop (linux) #14814

ysbaddaden commented Jul 16, 2024 •

edited

Loading

ysbaddaden commented Jul 16, 2024

yxhuvud left a comment

yxhuvud Jul 16, 2024

ysbaddaden Jul 16, 2024

ysbaddaden commented Jul 16, 2024 •

edited

Loading

ysbaddaden commented Jul 17, 2024 •

edited

Loading

yxhuvud commented Jul 17, 2024

ysbaddaden commented Jul 17, 2024

ysbaddaden commented Jul 18, 2024 •

edited

Loading

yxhuvud commented Jul 18, 2024

yxhuvud Jul 18, 2024

ysbaddaden Jul 19, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

yxhuvud left a comment

yxhuvud Jul 22, 2024

ysbaddaden Jul 22, 2024

ysbaddaden commented Jul 22, 2024 •

edited

Loading

ysbaddaden Jul 23, 2024 •

edited

Loading

ysbaddaden commented Jul 29, 2024

straight-shoota commented Sep 4, 2024

		raise "BUG: #{node.fd} is ready for reading but no registered reader for #{node.fd}!\n" if readable
		raise "BUG: #{node.fd} is ready for writing but no registered writer for #{node.fd}!\n" if writable

Epoll event loop (linux) #14814

Epoll event loop (linux) #14814

Conversation

ysbaddaden commented Jul 16, 2024 • edited Loading

ysbaddaden commented Jul 16, 2024

yxhuvud left a comment

Choose a reason for hiding this comment

yxhuvud Jul 16, 2024

Choose a reason for hiding this comment

ysbaddaden Jul 16, 2024

Choose a reason for hiding this comment

ysbaddaden commented Jul 16, 2024 • edited Loading

ysbaddaden commented Jul 17, 2024 • edited Loading

yxhuvud commented Jul 17, 2024

ysbaddaden commented Jul 17, 2024

ysbaddaden commented Jul 18, 2024 • edited Loading

yxhuvud commented Jul 18, 2024

yxhuvud Jul 18, 2024

Choose a reason for hiding this comment

ysbaddaden Jul 19, 2024

Choose a reason for hiding this comment

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

ysbaddaden commented Jul 20, 2024

yxhuvud left a comment

Choose a reason for hiding this comment

yxhuvud Jul 22, 2024

Choose a reason for hiding this comment

ysbaddaden Jul 22, 2024

Choose a reason for hiding this comment

ysbaddaden commented Jul 22, 2024 • edited Loading

ysbaddaden Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

ysbaddaden commented Jul 29, 2024

straight-shoota commented Sep 4, 2024

ysbaddaden commented Jul 16, 2024 •

edited

Loading

ysbaddaden commented Jul 16, 2024 •

edited

Loading

ysbaddaden commented Jul 17, 2024 •

edited

Loading

ysbaddaden commented Jul 18, 2024 •

edited

Loading

ysbaddaden commented Jul 22, 2024 •

edited

Loading

ysbaddaden Jul 23, 2024 •

edited

Loading