Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

oremanj · 2019-03-01T22:00:13Z

I'd like to propose that we deprecate checkpoint_if_cancelled() in favor of a synchronous function that I'm thinking of as trio.check_cancelled() (happy to take suggestions on the name, or move into hazmat). This would raise Cancelled if the current scope is cancelled, and do nothing otherwise. i.e., it is a pure cancel point, not a schedule point, not even a schedule point if we're cancelled.

This becomes possible due to some recent or pending changes:

Consider relaxing the await-always-checkpoints rule to await-always-checkpoints-on-non-error #474, which permits the sync-color by saying we don't need to checkpoint before raising an exception (I'm assuming Cancelled counts as an exception for the purposes of that rule)
Delay deciding which cancel scope a Cancelled exception belongs to #901, which means the Cancelled-raising path is just raise Cancelled._init()
Refactor cancellation system to eagerly propagate effective state #958, which makes it fast to see if we're cancelled or not

And I think would have benefits for:

Either optimize yield_briefly or de-optimize yield_briefly_no_cancel #70, by implementing efficient checkpoint() as await cancel_shielded_checkpoint(); check_cancelled()
Propagating cancellation through threads #606, allowing run_sync_in_worker_thread functions to call check_cancelled() occasionally without messing with portals (because it's synchronous! and if/when Refactor cancellation system to eagerly propagate effective state #958 lands I think we can make the check_cancelled() operation sufficiently thread-safe)
iirc Discussion: Cancel-myself-now operation? cancel-and-wait-for-it-to-finish operation? #658 indicated some user interest in a sync-colored raise-cancelled operation?

Possible downside: a sync-colored function called from an async function can now call check_cancelled() and wind up raising Cancelled. Previously that couldn't happen. My feeling is that "functions that do something off-the-wall might raise an exception their caller wasn't expecting" is hardly a new problem, but maybe there's some aspect of what we'd be giving up here that I'm missing.

The text was updated successfully, but these errors were encountered:

njsmith · 2019-03-02T02:08:18Z

I think the first step here is to implement #474 – update docs and tests, and clean up some of the current clutter for checkpointing error paths. (Of course this last part may take a while and it doesn't have to be perfect, but I think doing at least some cleanup will help make it clearer what we want here.)

implementing efficient checkpoint() as await cancel_shielded_checkpoint(); check_cancelled()

We could do this now. It's probably also easy to implement checkpoint as a dedicated trap, like cancel_shielded_checkpoint.

njsmith · 2019-06-13T08:09:22Z

#901, which means the Cancelled-raising path is just raise Cancelled._init()
#958, which makes it fast to see if we're cancelled or not

These don't handle KeyboardInterrupt, which will be an interesting challenge... (especially when we want to check for cancellation from a thread). In particular making this work will be tricky, because it requires delivering a pending KI through the cancellation system to something that isn't the main task:

def thread_func():
    os.kill(os.getpid(), signal.SIGINT)
    trio.from_thread.check_cancel()

async def main():
    await trio.to_thread.run_sync(thread_func)

trio.run(main)

I guess the simplest+fastest approach, if it worked, would be to make all forms of check-for-cancellation first check runner.ki_pending and then check current_cancel_status.effectively_cancelled.

Right now, we instead special-case the main task, so only it ever checks for ki_pending. There's two reasons for this:

When a new KI arrives, we have to try waking up some task to handle it. The main task is an obvious choice, since it always exists. This is totally arbitrary though; as far as this point is concerned, any task could work, or it would be fine if we only poked the main task, but allowed any task who notices a ki_pending to raise KeyboardInterrupt
KeyboardInterrupt has exactly-once semantics. And waking up a task is asynchronous. When we call the abort_fn, we're committed to waking that task up – it might immediately take some irrevocable action like calling CancelIoEx – but we won't know until later whether we actually succeeded in delivering the KeyboardInterrupt or not. By limiting ki_pending handling to a single task, we make it totally serial, which makes it much easier: we know that eventually this task will handle it. So it's fine to trigger the abort_fn, and if it works then great, and if not then no harm done, we'll just try again later.

The nasty case is: we triggered the abort_fn, we're still waiting to find out whether it worked, and someone calls check_cancel(). At that moment, we have no idea whether we still have a pending KeyboardInterrupt or not. And right now... our machinery doesn't even know whether we're still in that ambiguous period or not, because the semantics of raise_fn are just "once you have this, you can call it at some arbitrary later moment. Or not."

One approach would be to change our abort function interface, so it provides this information: I guess we'd have Abort.SUCCEEDED, Abort.FAILED, and Abort.MIGHT_SUCCEED_LATER. Plus some way to report back "yeah no, said it might succeed later but it didn't". Then when we need to deliver a KI, we could try calling the abort_fn, and if it says SUCCEEDED we're done, if it says FAILED then it's safe to set pending_ki = True and let any task consume it, and if it says MIGHT_SUCCEED_LATER then we... put pending_ki in some kind of third state, and only set it back to True if we later get a notification that the abort didn't succeed. This sounds possible but is complicated. (And wait_task_rescheduled is already too complicated!)

Another idea: in run_sync_in_thread, start the thread, then block in wait_task_rescheduled. In our blocked state, if our abort_fn is called, stash the raise_fn somewhere where the thread can see it. Then trio.from_thread.check_cancel() is just something like if THREAD_STATE.raise_fn is not None: raise_fn().

This might be simpler, but I bet it will also have some complications by the time we figure out how to forward arbitrary cancellations to the child thread, and also let the thread re-enter trio and deliver cancellations to those tasks.

This overlaps with #642.

oremanj · 2019-06-21T20:35:18Z

Yeah, KI being edge-triggered adds all sorts of unfortunate edge cases like this.

I wonder if we can just... drop the ability for the cancellation system to raise KI at all? You still get KI if it's delivered while non-protected user code is running. If it's delivered while KI protected, or during a sleep, then we cancel the system nursery and raise KI once everything unwinds. It could chain with the tree of Cancelled exceptions that propagates out of the main task.

Benefits:

only one way to cancel things
the main task isn't so special anymore
if you Ctrl+C your program because it's deadlocked or something, you get a traceback of all tasks, which is a lot more useful IMO than just getting the main task

Drawbacks: you can't directly catch KeyboardInterrupt within async code. Maybe others I'm not thinking of.

njsmith · 2019-06-22T01:09:26Z

Huh! interesting idea!

We have a nice trick to cancel-everything-and-raise: spawn a system task and have it raise the exception. Using this to inject KeyboardInterrupt would mean we had to rework things a bit so it didn't get converted to TrioInternalError, but we maybe need to do that anyway for #1056. (Basically we would inject a "user-flavored system task" rather than a "trio-flavored system task".)

if you Ctrl+C your program because it's deadlocked or something, you get a traceback of all tasks, which is a lot more useful IMO than just getting the main task

This part would require a special-case hack so that this exception in particular sticks the Cancelled tree..... somewhere, not 100% sure where. (__context__ seems a bit weird since that usually means something else.) And whether you got that special traceback would depend on whether the control-C got injected into user code vs. injected through the fallback mechanism.

I'm thinking maybe we should consider this separately as part of our overall make-deadlock-less-frustrating efforts. (Which start with #1085 and #927. Maybe the stack tree dumper should also be triggered on control-C on Windows when running with python -X dev or something, since windows doesn't have SIGIO? Or if control-C gets hit a bunch of times quickly?)

Another advantage you didn't mention: I think we could simplify wait_task_rescheduled by getting rid of raise_fn entirely.

Drawbacks: you can't directly catch KeyboardInterrupt within async code.

Yeah, I think this is the main drawback. It's a bit tricky to judge: except KeyboardInterrupt always has edge cases where it won't work, but lots of people do it anyway. If the KeyboardInterrupt is raised from trio.run but not inside trio, then people can still catch it if they set up their try in the right place... but not everyone will realize that. So I don't have a great sense of how strong the impact will be.

Something that just occurred to me: the "inject a task that does raise KeyboardInterrupt" trick actually works in any nursery, not just the system nursery. And most Trio programs contain at least one user nursery. We could pick an arbitrary user nursery and use that to inject the KeyboardInterrupt. We would still need to have some policy for handling programs where there are no other nurseries, but it would dramatically reduce how often we needed to use it...

smurfix · 2019-06-22T07:25:13Z

If the KeyboardInterrupt is raised from trio.run but not inside trio, then people can still catch it if they set up their try in the right place... but not everyone will realize that.

So document it. Also, people will realize this the first time they hit the problem. We could also play with variable names etc. in the stack trace again, to tell them. ;-)

This is a bit of a kluge, and will hopefully be cleaned up in the future when we overhaul KeyboardInterrupt (python-triogh-733) and/or the cancellation checking APIs (python-triogh-961). But 'checkpoint()' is a common operation, and this speeds it up by ~7x, so... not bad for a ~4 line change. Before this change: - `await checkpoint()`: ~24k/second - `await cancel_shielded_checkpoint()`: ~180k/second After this change: - `await checkpoint()`: ~170k/second - `await cancel_shielded_checkpoint()`: ~180k/second Benchmark script: ```python import time import trio LOOP_SIZE = 1_000_000 async def main(): start = time.monotonic() for _ in range(LOOP_SIZE): await trio.lowlevel.checkpoint() #await trio.lowlevel.cancel_shielded_checkpoint() end = time.monotonic() print(f"{LOOP_SIZE / (end - start):.2f} schedules/second") trio.run(main) ```

oremanj added design discussion potential API breaker labels Mar 2, 2019

njsmith mentioned this issue Mar 4, 2019

Refactor cancellation system to eagerly propagate effective state #958

Merged

oremanj mentioned this issue Mar 12, 2019

Maybe rework run_in_worker_thread API for cancellation handling? #60

Open

oremanj added cancellation threads labels May 4, 2019

This was referenced Jun 13, 2019

Some way to take a cancellation in one call to wait_task_rescheduled and forward it to another task #642

Open

Better usability for threads #810

Open

oremanj mentioned this issue Oct 17, 2019

wait_task_rescheduled API revision umbrella issue #1253

Open

njsmith mentioned this issue May 12, 2020

Rethink KeyboardInterrupt protection APIs #733

Open

njsmith mentioned this issue Jun 14, 2020

Make trio.lowlevel.checkpoint much faster #1613

Merged

richardsheridan mentioned this issue Aug 8, 2022

Expand cancellation usability from native trio threads #2392

Merged

arthur-tacca mentioned this issue Apr 12, 2024

Cancellation may not always immediately happen at checkpoints #2973

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

oremanj commented Mar 1, 2019

njsmith commented Mar 2, 2019

njsmith commented Jun 13, 2019

oremanj commented Jun 21, 2019

njsmith commented Jun 22, 2019

smurfix commented Jun 22, 2019

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

Proposal: make checkpoint_if_cancelled() sync-colored and rename it accordingly #961

Comments

oremanj commented Mar 1, 2019

njsmith commented Mar 2, 2019

njsmith commented Jun 13, 2019

oremanj commented Jun 21, 2019

njsmith commented Jun 22, 2019

smurfix commented Jun 22, 2019