Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a blocked task watchdog #591

Open
njsmith opened this issue Aug 4, 2018 · 5 comments
Open

Add a blocked task watchdog #591

njsmith opened this issue Aug 4, 2018 · 5 comments

Comments

@njsmith
Copy link
Member

njsmith commented Aug 4, 2018

In this stackoverflow question, someone gets very confused because they're using time.sleep and never using await, so one of their tasks seems to run correctly but the other doesn't. There's no indication of what's going on. Easy to fix once you know how, but if you don't, then how are you supposed to figure it out?

We could, pretty cheaply, add a watchdog to catch such situations. What I'm imagining is that when entering trio.run, we'd spawn a system task that spins up a thread and watches for the main loop to become stuck. In the main run loop, we'd send it messages occasionally, whenever we (a) start processing a batch of tasks, (b) stop processing a bunch of tasks. If the watchdog thread detects that the time between (a) and (b) exceeds, say, 5 seconds, then it should print an informative message and use the faulthandler module to dump everyone's stack traces. I think this might be cheap enough it could be enabled-by-default.

@njsmith
Copy link
Member Author

njsmith commented Feb 13, 2019

There was an attempt to do this in #596, which I think is abandoned. One of the reasons is that we had trouble figuring out how to keep the watchdog overhead low and get the cross-thread communication right.

I just had a thought for a pretty simple approach, that doesn't require a Condition or other fancy stuff:

@attr.s
class Watchdog:
    lock = attr.ib(factory=threading.Lock)
    shutdown_event = attr.ib(factory=threading.Event)
    handle_io_count = attr.ib(default=0)
    
    def watchdog_thread(self):
        while True:
            with self.lock:
                hic1 = self.handle_io_count
            if self.shutdown_event.wait(timeout=timeout):
                return
            with self.lock:
                hic2 = self.handle_io_count:
            if hic1 == hic2:
                # ... watchdog fires ...

# trio.run
while True:
    timeout = ...
    with watchdog.lock:
        watchdog.handle_io_count += 1
        runner.io_manager.handle_io(timeout)
    ...

So obviously this is deadlock-free, since there's only one lock. But does it work?

Every time the main thread takes the lock, it increments the counter.

The watchdog takes the lock twice per pass through the loop. Case 1: the main thread takes the lock in between these. That means it incremented the counter, and thus the two reads of self.handle_io_count will be different, and the watchdog won't fire. Case 2: the main thread doesn't take the lock in between these. That means that timeout seconds passed during which trio was never inside handle_io, so the watchdog should fire. And indeed, we'll find hic1 == hic2, so the watchdog does fire.

The thing that's clever/tricky about this what happens if handle_io blocks for a long time (if there's nothing to do). Since we hold the lock while calling it, this will cause the watchdog thread to block for just as long! So it doesn't keep polling useless while the main thread is sleeping. And then when the watchdog thread is finally released, it may have actually slept for a very long time indeed, way longer than its normal timeout... but it shouldn't fire, and it doesn't.

Also, the watchdog thread's overhead is very small, and bounded. In particular, if the main loop is very busy and generating lots and lots of calls to handle_io, then that doesn't cause any extra work for the watchdog thread, except for slightly higher mutex contention. But the contention should still be very low, since the watchdog thread only wakes up and and takes the lock for a very very brief time, very rarely (assuming a reasonable timeout).

We don't need to worry about handle_io_count overflowing, because Python ints don't overflow. Do we need to worry about it not overflowing, and getting so huge that it causes slowdowns? On my laptop, incrementing 0 takes ~26 ns, and incrementing 2**60 takes ~38 ns. So the slowdown is pretty miniscule. And if we somehow managed to pass through the run loop in 1 ns (which is optimistic by a factor of >1000x), then 2**60 loops would take ~36 years. So I don't think we need to worry about the counter getting too large.

Basically this seems... super simpler and like it should just work.

@njsmith
Copy link
Member Author

njsmith commented Jun 2, 2019

Hmm, while trying to fall asleep tonight my brain just pointed out what might be a serious problem for this idea. There is a case where it's quite normal for trio to block and take arbitrarily long between scheduler loops: when the user hits control-Z or sends a SIGSTOP to pause the program.

Suppose this signal arrives just after the watchdog goes to sleep. When the program wakes up again, the sleep will immediately complete. So the watchdog's effective deadline might get reduced arbitrarily close to zero.

Can we detect this somehow? It might be tricky. Technically we could install a SIGCONT handler, I think, but that sounds really finicky.

We could check how long we slept, and if it's longer than we expected to sleep then it's good evidence that something weird is going on and maybe we should start over. Of course, it's possible for someone to un-suspend us exactly at the moment we would have woken up anyway, so this can't be totally reliable, but at least it catches lots of obvious cases.

It would be really nice if there were some to find out when you mysteriously missed time, but I don't know any. Maybe something buried on getrusage or timerfd or similar?

I guess debugging is another case where the watchdog would go off spuriously. Though there at least you hopefully know that you're sitting and looking at a debugger; it's not a message being delivered to some random end user.

@njsmith
Copy link
Member Author

njsmith commented Jun 6, 2019

Weird fact that's probably not quite enough to be useful: On linux, if you use select, and the signal has SA_RESTART set so that the kernel/libc handle restarting syscalls automatically, then the select timeout's countdown will pause while the process is sleeping. So e.g. if you request a sleep of 10 seconds, and in the middle you put the process to sleep for 5 seconds, then the process sleeps for 15 seconds.

This doesn't work on macOS, and it doesn't work with any other syscalls on Linux AFAICT. (In particular it definitely doesn't work with poll, epoll_wait, or whichever condition-related syscall that threading.Event.wait uses. It does work with Python's time.sleep, but that's because time.sleep calls select internally.)

Also, it could break if somehow the signal doesn't have SA_RESTART behavior. I guess SIGSTOP effectively always does, and the other go-to-sleep signals like SIGTSTP, SIGTTIN, SIGTTOU do by default, but if someone registers a handler on one of these three then they might or might not use SA_RESTART.

Oh but wait, in our case the select would be running in a child thread, so it's unlikely that we'd get asked to run the signal handler. And we could even use pthread_sigmask to make sure.

So.... I guess that actually does solve this on Linux? We'd need to use a socket to send shutdown messages from the main thread to the watchdog thread, so the watchdog thread can use select to look for them, but that's not a huge deal.

Unfortunately this doesn't help with macOS... or Windows, for that matter, but on Windows it's very rare to suspend processes, so it's not as big a deal.

@njsmith
Copy link
Member Author

njsmith commented Jun 6, 2019

Maybe it would be sufficient to just like... reduce the timeout threshold some, and then require it to fail N times in a row before triggering a warning.

@njsmith
Copy link
Member Author

njsmith commented Jun 6, 2019

Oh but on macOS we have kqueue, so this is easy – the watchdog could use EVFILT_SIGNAL to watch for SIGCONT. At least according to the man page, this is reliable and shouldn't interfere with the rest of the program:

     EVFILT_SIGNAL    Takes the signal number to monitor as the identifier and
                      returns when the given signal is generated for the
                      process.  This coexists with the signal() and
                      sigaction() facilities, and has a lower precedence.
                      Only signals sent to the process, not to a particular
                      thread, will trigger the filter. The filter will record
                      all attempts to deliver a signal to a process, even if
                      the signal has been marked as SIG_IGN.  Event notifica-
                      tion happens before normal signal delivery processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants