Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

Open
oremanj opened this issue Mar 1, 2019 · 5 comments

Comments

@oremanj
Copy link
Member

oremanj commented Mar 1, 2019

I'd like to propose that we deprecate checkpoint_if_cancelled() in favor of a synchronous function that I'm thinking of as trio.check_cancelled() (happy to take suggestions on the name, or move into hazmat). This would raise Cancelled if the current scope is cancelled, and do nothing otherwise. i.e., it is a pure cancel point, not a schedule point, not even a schedule point if we're cancelled.

This becomes possible due to some recent or pending changes:

And I think would have benefits for:

Possible downside: a sync-colored function called from an async function can now call check_cancelled() and wind up raising Cancelled. Previously that couldn't happen. My feeling is that "functions that do something off-the-wall might raise an exception their caller wasn't expecting" is hardly a new problem, but maybe there's some aspect of what we'd be giving up here that I'm missing.

@njsmith
Copy link
Member

njsmith commented Mar 2, 2019

I think the first step here is to implement #474 – update docs and tests, and clean up some of the current clutter for checkpointing error paths. (Of course this last part may take a while and it doesn't have to be perfect, but I think doing at least some cleanup will help make it clearer what we want here.)

implementing efficient checkpoint() as await cancel_shielded_checkpoint(); check_cancelled()

We could do this now. It's probably also easy to implement checkpoint as a dedicated trap, like cancel_shielded_checkpoint.

@njsmith
Copy link
Member

njsmith commented Jun 13, 2019

#901, which means the Cancelled-raising path is just raise Cancelled._init()
#958, which makes it fast to see if we're cancelled or not

These don't handle KeyboardInterrupt, which will be an interesting challenge... (especially when we want to check for cancellation from a thread). In particular making this work will be tricky, because it requires delivering a pending KI through the cancellation system to something that isn't the main task:

def thread_func():
    os.kill(os.getpid(), signal.SIGINT)
    trio.from_thread.check_cancel()

async def main():
    await trio.to_thread.run_sync(thread_func)

trio.run(main)

I guess the simplest+fastest approach, if it worked, would be to make all forms of check-for-cancellation first check runner.ki_pending and then check current_cancel_status.effectively_cancelled.

Right now, we instead special-case the main task, so only it ever checks for ki_pending. There's two reasons for this:

  • When a new KI arrives, we have to try waking up some task to handle it. The main task is an obvious choice, since it always exists. This is totally arbitrary though; as far as this point is concerned, any task could work, or it would be fine if we only poked the main task, but allowed any task who notices a ki_pending to raise KeyboardInterrupt

  • KeyboardInterrupt has exactly-once semantics. And waking up a task is asynchronous. When we call the abort_fn, we're committed to waking that task up – it might immediately take some irrevocable action like calling CancelIoEx – but we won't know until later whether we actually succeeded in delivering the KeyboardInterrupt or not. By limiting ki_pending handling to a single task, we make it totally serial, which makes it much easier: we know that eventually this task will handle it. So it's fine to trigger the abort_fn, and if it works then great, and if not then no harm done, we'll just try again later.

The nasty case is: we triggered the abort_fn, we're still waiting to find out whether it worked, and someone calls check_cancel(). At that moment, we have no idea whether we still have a pending KeyboardInterrupt or not. And right now... our machinery doesn't even know whether we're still in that ambiguous period or not, because the semantics of raise_fn are just "once you have this, you can call it at some arbitrary later moment. Or not."

One approach would be to change our abort function interface, so it provides this information: I guess we'd have Abort.SUCCEEDED, Abort.FAILED, and Abort.MIGHT_SUCCEED_LATER. Plus some way to report back "yeah no, said it might succeed later but it didn't". Then when we need to deliver a KI, we could try calling the abort_fn, and if it says SUCCEEDED we're done, if it says FAILED then it's safe to set pending_ki = True and let any task consume it, and if it says MIGHT_SUCCEED_LATER then we... put pending_ki in some kind of third state, and only set it back to True if we later get a notification that the abort didn't succeed. This sounds possible but is complicated. (And wait_task_rescheduled is already too complicated!)

Another idea: in run_sync_in_thread, start the thread, then block in wait_task_rescheduled. In our blocked state, if our abort_fn is called, stash the raise_fn somewhere where the thread can see it. Then trio.from_thread.check_cancel() is just something like if THREAD_STATE.raise_fn is not None: raise_fn().

This might be simpler, but I bet it will also have some complications by the time we figure out how to forward arbitrary cancellations to the child thread, and also let the thread re-enter trio and deliver cancellations to those tasks.

This overlaps with #642.

@oremanj
Copy link
Member Author

oremanj commented Jun 21, 2019

Yeah, KI being edge-triggered adds all sorts of unfortunate edge cases like this.

I wonder if we can just... drop the ability for the cancellation system to raise KI at all? You still get KI if it's delivered while non-protected user code is running. If it's delivered while KI protected, or during a sleep, then we cancel the system nursery and raise KI once everything unwinds. It could chain with the tree of Cancelled exceptions that propagates out of the main task.

Benefits:

  • only one way to cancel things
  • the main task isn't so special anymore
  • if you Ctrl+C your program because it's deadlocked or something, you get a traceback of all tasks, which is a lot more useful IMO than just getting the main task

Drawbacks: you can't directly catch KeyboardInterrupt within async code. Maybe others I'm not thinking of.

@njsmith
Copy link
Member

njsmith commented Jun 22, 2019

Huh! interesting idea!

We have a nice trick to cancel-everything-and-raise: spawn a system task and have it raise the exception. Using this to inject KeyboardInterrupt would mean we had to rework things a bit so it didn't get converted to TrioInternalError, but we maybe need to do that anyway for #1056. (Basically we would inject a "user-flavored system task" rather than a "trio-flavored system task".)

if you Ctrl+C your program because it's deadlocked or something, you get a traceback of all tasks, which is a lot more useful IMO than just getting the main task

This part would require a special-case hack so that this exception in particular sticks the Cancelled tree..... somewhere, not 100% sure where. (__context__ seems a bit weird since that usually means something else.) And whether you got that special traceback would depend on whether the control-C got injected into user code vs. injected through the fallback mechanism.

I'm thinking maybe we should consider this separately as part of our overall make-deadlock-less-frustrating efforts. (Which start with #1085 and #927. Maybe the stack tree dumper should also be triggered on control-C on Windows when running with python -X dev or something, since windows doesn't have SIGIO? Or if control-C gets hit a bunch of times quickly?)

Another advantage you didn't mention: I think we could simplify wait_task_rescheduled by getting rid of raise_fn entirely.

Drawbacks: you can't directly catch KeyboardInterrupt within async code.

Yeah, I think this is the main drawback. It's a bit tricky to judge: except KeyboardInterrupt always has edge cases where it won't work, but lots of people do it anyway. If the KeyboardInterrupt is raised from trio.run but not inside trio, then people can still catch it if they set up their try in the right place... but not everyone will realize that. So I don't have a great sense of how strong the impact will be.

Something that just occurred to me: the "inject a task that does raise KeyboardInterrupt" trick actually works in any nursery, not just the system nursery. And most Trio programs contain at least one user nursery. We could pick an arbitrary user nursery and use that to inject the KeyboardInterrupt. We would still need to have some policy for handling programs where there are no other nurseries, but it would dramatically reduce how often we needed to use it...

@smurfix
Copy link
Contributor

smurfix commented Jun 22, 2019

If the KeyboardInterrupt is raised from trio.run but not inside trio, then people can still catch it if they set up their try in the right place... but not everyone will realize that.

So document it. Also, people will realize this the first time they hit the problem. We could also play with variable names etc. in the stack trace again, to tell them. ;-)

njsmith added a commit to njsmith/trio that referenced this issue Jun 14, 2020
This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the
cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

- `await checkpoint()`: ~24k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

After this change:

- `await checkpoint()`: ~170k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

Benchmark script:

```python
import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)
```
njsmith added a commit to njsmith/trio that referenced this issue Jun 14, 2020
This is a bit of a kluge, and will hopefully be cleaned up in the
future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the
cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common
operation, and this speeds it up by ~7x, so... not bad for a ~4 line
change.

Before this change:

- `await checkpoint()`: ~24k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

After this change:

- `await checkpoint()`: ~170k/second
- `await cancel_shielded_checkpoint()`: ~180k/second

Benchmark script:

```python
import time
import trio

LOOP_SIZE = 1_000_000

async def main():
    start = time.monotonic()
    for _ in range(LOOP_SIZE):
        await trio.lowlevel.checkpoint()
        #await trio.lowlevel.cancel_shielded_checkpoint()
    end = time.monotonic()
    print(f"{LOOP_SIZE / (end - start):.2f} schedules/second")

trio.run(main)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants