Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Cancel-myself-now operation? cancel-and-wait-for-it-to-finish operation? #658

Closed
belm0 opened this issue Sep 10, 2018 · 27 comments

Comments

@belm0
Copy link
Member

belm0 commented Sep 10, 2018

it isn't apparent from the docs that cancel() is non-raising, and that when you call it from within a nursery the program flow will continue until the next checkpoint

it seems like a common use case will be to abort from within a nursery scope using cancel() and sleep(0). Perhaps nursery should have a cancel_now() or abort() method

my use case was related to having a task meet the start() protocol invariant of either calling started or raising

@belm0
Copy link
Member Author

belm0 commented Sep 10, 2018

this documentation adds to the confusion:

Attempting to raise Cancelled yourself will cause a RuntimeError. It would not be associated with a cancel scope and thus not be caught by Trio. Use cancel_scope.cancel() instead.

which suggests that cancel_scope.cancel() would be a suitable solution for the case when you want to immediately stop execution from within a nursery scope.

@belm0
Copy link
Member Author

belm0 commented Sep 10, 2018

is there a way to make raise Cancelled from within a nursery scope just work? 🙏

@njsmith
Copy link
Member

njsmith commented Sep 10, 2018

We should definitely be clearer about cancel not being a checkpoint in the docs – it's logical and internally consistent but surprising to folks coming from other systems. It's actually a good example for driving home the right way to think about trio cancellation.

If we do add a method that does cancel+checkpoint, then the cancel scope would be a more obvious place for it than the nursery, because it's generic cancellation-related functionality.

my use case was related to having a task meet the start() protocol invariant of either calling started or raising

I'm having some trouble visualizing this... It's one of those times where the use case is super clear in your head because you're looking at it, but the rest of us have a bit more trouble :-). I guess you're... starting a task, and also passing in the nursery where start was called, and sometimes during startup the task decides to abort everything in that nursery, and this isn't an error so it can't raise an exception, but it's also not a success so it can't call started? Is it... something like that?

@njsmith
Copy link
Member

njsmith commented Sep 10, 2018

is there a way to make raise Cancelled from within a nursery scope just work? 🙏

Nurseries can be nested, so how would trio know which nursery you were throwing the Cancelled to?

@belm0
Copy link
Member Author

belm0 commented Sep 10, 2018

Nurseries can be nested, so how would trio know which nursery you were throwing the Cancelled to?

Assuming we added another cancel-like exception for this case, it would apply to the first nursery catching it. That manager would convert it to a cancel scope + checkpoint.

I'm having some trouble visualizing this...

The specific use case was was trying to upgrade this task in trio-websocket to used the start() protocol, and have it call started() after the HTTP handshake is done. Instead of the break statements in the code I wanted to abort the nursery scope. (Granted this may all go away with trio-websocket design improvements.)

https://github.com/HyperionGray/trio-websocket/blob/94fc77b863f70b4d15c16bf44b602224c5b5426f/trio_websocket/__init__.py#L255-L285

@njsmith
Copy link
Member

njsmith commented Sep 10, 2018

trio-websocket

Agreed that it's probably better to do the handshake before starting that task. But also: I don't see why you can't just call started there? Or raise an exception? (Note that if you raise an exception before calling started, it comes out of the call to start, not the normal nursery path, so your startup code can catch it and handle it.)

@njsmith
Copy link
Member

njsmith commented Sep 10, 2018

it would apply to the first nursery catching it.

We really try to avoid constructs like this, where the result depends on some subtle aspect of the context where the code is called. It creates non-obvious coupling and breaks compositionally/separation of concerns.

And this case is actually a really great example of that :-). As it turns out, start internally creates a temporary nursery, spawns the task into that temporary nursery, and then later moves it to the final nursery (when you call started). This gives nice semantics for start being cancelled or an early exception.

We can get away with this, because there's no way for user code to tell that there's a temporary nursery in there (modulo debugging APIs). With your raise Cancelled idea though, this internal implementation trick would become exposed as part of the public API. Also, it means that in your specific case, the nearest nursery actually isn't the one you expected!

@belm0
Copy link
Member Author

belm0 commented Sep 10, 2018

Thank you for the explanation. I think the priorities are:

  1. clarify cancel() behavior in the docs
  2. improve documentation note about explicitly raising Cancelled(), suggesting sleep(0) after cancel() when requesting the cancel from within the cancel scope to have it end program flow
  3. possibly add a cancel + sleep convenience method to cancel scope

@smurfix
Copy link
Contributor

smurfix commented Sep 10, 2018

Well, we could have cancel() return something with an __await__ attribute, so that await scope.cancel() would actually be that "cancel+sleep" convenience method.

Probably not a good idea from the PoV of verifying code correctness, but …

@oremanj
Copy link
Member

oremanj commented Sep 10, 2018

Confusing the matter further, the "cancel and immediately raise" operation only makes sense if it's called from within the cancel scope. In general, one task can cancel some work in another task just fine, but it wouldn't make sense to perform a naive implementation of this "cancel immediately" operation across tasks.

If we go with await cancel_scope.cancel(), we could add the semantics that if invoked from a different task, the await sleeps until the cancel scope actually gets exited. That seems fairly ergonomic and like it might even be useful sometimes, but maybe a bit of a pain to implement compared to other approaches.

In general trio tends to eschew foo() and await foo() both being valid statements that do different things, so we might want a different name: wait_cancelled()? cancel_and_wait()?

The other possible API would be something that does cancel() and raises a Cancelled exception for that scope. It would not need to be async, but would only make sense inside the cancelled scope, so there would need to be a check for that. cancel_and_raise()?

I'd prefer not to reuse the term "abort" as that already has a meaning in trio (waking up a sleeping task to deliver a Cancelled or a KeyboardInterrupt exception).

@belm0
Copy link
Member Author

belm0 commented Sep 11, 2018

Having a separate async cancel method that does the right thing both inside and outside of the cancel scope sounds great.

The naming will have to be a compromise, cancel_and_wait() sounds like the best we could do. Ideally we'd follow trio naming conventions (cancel() and cancel_nowait()), but that would be an API breakage nightmare. While cancel_and_wait doesn't make it obvious that it will always raise when inside the cancelled scope, I think we'll get used to it.

@njsmith
Copy link
Member

njsmith commented Sep 11, 2018

cancel_and_wait

Interesting idea.

This would interact awkwardly with #607, since an unbound cancel scope might never be entered, and thus never exited either. My intuition is that if we have to pick, then unbound cancel scopes are more useful than the ability to wait for a cancel scope to exit, mostly because I've noticed cases where I wanted unbound cancel scopes but not noticed cases where I wanted to wait for a cancellation. But maybe that's because I was looking :-).

I still don't have any intuition for why cancel_and_checkpoint is an important shorthand to support. Using cancel_and_wait to handle both waiting for the cancel scope and also raising immediately is clever, but it seems confusing to have a single method designed to perform opposite operations like this...

@smurfix
Copy link
Contributor

smurfix commented Sep 11, 2018

cancel_and_wait

Meh. If you're inside the scope to be cancelled, this doesn't wait – it raises Cancelled.

I don't want a single method that affects control flow in two very different ways (and is named for only one of them); if we do that, we might as well have a cancel method that can be awaited (or not).

If we don't want that (and I'm no longer sure whether keeping three names in mind is any worse than keeping three ways of using .cancel in the back of your brain, at least not in this case) we might add something like a cancel_me method that raises a RuntimeError if it's acting on a cancellation scope which the current task is not a member of. This method will thus never succeed, thus it behaves and is named differently than cancel_and_wait.

This would interact awkwardly with #607, since an unbound cancel scope might never be entered, and thus never exited either.

Well, I'd wait only for the tasks to exit that are currently within the scope; if none are, return immediately.

@oremanj
Copy link
Member

oremanj commented Sep 11, 2018

If you're inside the scope to be cancelled, this doesn't wait – it raises Cancelled.

Just like anything else that blocks inside a cancelled cancel scope. :-) Though I understand the potential for confusion.

I'd wait only for the tasks to exit that are currently within the scope; if none are, return immediately.

AIUI, unbound cancel scopes are intended to fix races where you might be cancelling before the scope gets entered, so having cancel_and_wait() return before the scope gets entered could be confusing. It seems hard to pick one behavior (wait for the scope to be exited VS return immediately if not entered yet) that would be intuitive in all cases.

if we have to pick, then unbound cancel scopes are more useful than the ability to wait for a cancel scope to exit

I agree - I don't have specific use cases for "wait for cancel scope to exit" either, and it would entail some internal complexity / arguably some dependency inversion (cancel scopes currently don't have anything async going on).

I'd support cancel_me(), or maybe call it cancel_noreturn() or cancel_and_raise() or something else, implemented as a synchronous function. That provides a useful behavior (synchronously raising a Cancelled exception) that's not possible to obtain with other public APIs, and would have a straightforward implementation.

@belm0
Copy link
Member Author

belm0 commented Sep 12, 2018

I'd support cancel_me(), or maybe call it cancel_noreturn() or cancel_and_raise() or something else, implemented as a synchronous function.

I think it's fair to revisit async options if some real use cases show up.

+1 for "cancel and raise" method, though it really needs the RuntimeError safeguard mentioned by @smurfix for "I'm not within the scope being cancelled".

@belm0
Copy link
Member Author

belm0 commented Sep 18, 2018

Just had a case on my project where someone got burned thinking scope cancel is immediate.

I'd like to confirm: any checkpoint will cause all pending cancel requests to be executed?

@njsmith
Copy link
Member

njsmith commented Sep 18, 2018

Currently, yes, at every checkpoint, trio checks all the current cancel scopes, and if any of them are in the cancelled state then it raises Cancelled.

Though now that you mention it... the theoretical guarantee should probably be a bit weaker than that. I suspect that we may eventually want to start skipping checkpoints sometimes as an efficiency measure. So the theoretical guarantee should be something like, "as long as your code executes checkpoints regularly, then cancellation will be delivered fairly promptly", but without guaranteeing that it will actually be delivered at the next checkpoint; maybe it will be after 2-3 checkpoints or something.

I still don't have an intuition for why you want to be able to raise Cancelled immediately out of some code. Cancellation is designed to be used to interrupt other code that's in some partially-unknown state. If you're already in the code that you want to interrupt, you already have full control and can just raise an error or return or whatever. This doesn't mean you're doing something wrong, necessarily, but it does sound like you're trying to fight against the design, which is often not the easiest path forward.

@belm0
Copy link
Member Author

belm0 commented Sep 18, 2018

In this last case the code taking cancel action was actually outside of the cancel scope.

Those weaker guarantees sound problematic. For example if something is managing tasks, and needs to cancel one and start another while ensuring that their lifetimes are mutually exclusive, it would be nice to just call cancel() and start_soon() and be done with it. That's the case now I believe since, on the next checkpoint, the scheduler will first processes cancels and then schedule new tasks. Or if that's too dependent on implementation, sure we can call sleep(0) after cancel to make sure the cancel is handled first.

But if you're saying that may change to require multiple checkpoints to service outstanding cancels, we'd have to adopt more complex signaling for this case.

@njsmith
Copy link
Member

njsmith commented Sep 18, 2018

It's true that right now the Trio scheduler advances by a series of "ticks", and in each tick it runs each runnable task forward by one checkpoint. But please don't rely on this as an inter-task synchronization mechanism. We definitely reserve the right to switch to more sophisticated scheduling algorithms in the future (see #32), and even if we don't, it's just incredibly fragile. Processing a cancellation can easily take multiple "ticks" for the exception to propagate (e.g. if you have any async with blocks, their __aexit__ often contains checkpoints), and some operations can take arbitrarily long to cancel (e.g. run_sync_in_worker_thread – and note that this is used internally by some functions in trio, like the async file aclose method). Also, even in the simplest case where none of these things apply, doing old_thing.cancel(); await sleep(0); nursery.start_soon(new_thing) still won't guarantee that old_thing and new_thing don't overlap. In fact, this is still true even if you do await sleep(0); await sleep(0):

  1. Task A: old_thing goes to sleep, waiting for something to happen
  2. the thing happens; old_thing is placed onto the run queue
  3. Task B: the code above runs – old_thing's scope is marked as cancelled, and then await sleep(0) suspends task B.
  4. The scheduler starts up for the next tick, looks at the run queue, and sees task A and task B. It arbitrarily decides to run task B first.
  5. Task B runs. If we only had one await sleep(0), this would immediately violate our invariant, because the new_thing task would get started before task A is cancelled. But let's say we have two await sleep(0)'s. So, task B gets scheduled, and it immediately gets suspended again and put back on the run queue.
  6. Then task A runs. It's been cancelled, but it was already on the run queue when that happened so it was too late to deliver the cancellation this time around. It runs to its next checkpoint, then notices the cancellation. It gets placed on the run queue, and when it gets resumed it will raise Cancelled.
  7. The scheduler starts up for the next tick, looks at the run queue, and sees task A and task B. It arbitrarily decides to run task B first.
  8. Task B runs, and it calls start_soon(new_thing), even though Task A hasn't even seen a Cancelled exception yet. Our invariant is violated.

You really don't want to have to do this kind of analysis to figure out if your code is correct.

If the semantics you need are cancel-and-wait-for-cancelled-thing-to-exit, then you should implement something that has those semantics :-). Or like... if you need to make sure exactly one of these tasks is running at a time, can't you just have them grab a Lock, so the new task waits for the old task to exit before doing anything?

@belm0
Copy link
Member Author

belm0 commented Sep 19, 2018

Thank you for shedding light on that fragility-- makes sense.

In this particular case, use of locks in the task is not suitable. For the benefit of the task author's mental model, a task being run implies it has the green light to the resources it needs and won't block on things for indefinite time. This simplifies coding of this particular task type and reduces the necessary skill level of authors. Using locking is much different than general use of context managers to ensure things are restored to an expected state (which make take time) once the task ends or is cancelled.

Given what you described, having an awaitable cancel sounds very useful. Please consider it.

As it is I'll have to implement it myself by wrapping tasks and having a centralized object to wait on task completion.

@njsmith
Copy link
Member

njsmith commented Sep 19, 2018

The challenge for awaitable cancel is what I said above... maybe it's confusing that you could have an awaitable cancel that ends up waiting forever because the cancel scope never gets entered? (Perhaps because the code that was going to enter it got cancelled by something else?) That doesn't mean I'm not considering it, I'm just sharing some of my considering :-).

I guess another complexity is whether we want to allow cancel scopes to be re-entered. If we do, then the definition of "cancel scope has exited" becomes even more unclear. I guess we could avoid this by using the "linked" scope idea from the unbound cancel thread, but that idea makes waiting for a cancellation to finish even more poorly defined.

As a general note, I appreciate that you've been trying to distill these cases down to their essence, and also that you might have reasons that you can't share too many details. But generally if you can, I'd actually prefer much more details rather than less... maybe this is just a personal thing, but I find it very difficult to weigh these kinds of design trade-offs without messy, detailed, concrete examples to think about. And right now I know of multiple concrete examples where unbound cancel scopes are useful (because I ran into them myself :-)), but don't have concrete examples of needing to wait for a cancellation to finish, so I find it hard to form an opinion either way...

@python-trio python-trio deleted a comment from njsmith Sep 19, 2018
@smurfix
Copy link
Contributor

smurfix commented Sep 19, 2018

Hmm. "Cancellation finishes" == "no code is running within the cancel scope's context". Seems easy enough; trying to enter an already-cancelled scope would have to raise an AlreadyCancelledError exception if we want to sustain that invariant.

@njsmith
Copy link
Member

njsmith commented Sep 20, 2018

In the current design, where all cancel scopes are associated with exactly one span of running code, then "has the code inside this scope finished?" is a perfectly well-defined question.

AFAICT basically any way of relaxing that rule though leads to nasty snarls, that I suspect are unsolvable.

Having an AlreadyCancelledError happen when you enter an unbound scope seems like it defeats the purpose of unbound cancel scopes, which is that you don't have to care whether cancel is called before or after the code enters the scope, because it acts the same. You might imagine a refinement where we make entering a scope be a checkpoint, and if it's cancelled it just skips the code inside altogether... but that's not something that async with blocks can actually do in Python, so it's impossible. (And also is probably a bad idea, for similar reasons to why trio lets you spawn tasks into an already-cancelled nursery: you want to give finally blocks a chance to execute to clean up resources.)

#607 also discusses other possible extensions of cancel scope lifetimes, like having a scope that you can exit and re-enter (imagine a loop where one part is subject to cancellation and another part is not), or entering the same scope multiple times simultaneously in different tasks. The specific proposal there is to continue to limit each CancelScope object to at most 1 entry/exit, but also allow something like cancel_scope_2 = CancelScope(inherit=cancel_scope_1), to create a new scope that becomes cancelled whenever the old scope becomes cancelled. In this approach, the original cancel scope that gets passed in will generally never be entered; instead all these forked-off copies of it will be entered and exited. And obviously that creates huge problems for code that wants to check whether the cancelled operation has finished, and can only see the original scope, not the forked-off copies.

This issues seem like they're pretty fundamental: if you give cancel scope users flexibility in defining the boundary of the abstract "operation" that you want to cancel, then it becomes hard to tell where that boundary is when you want to check whether the "operation" has finished.

I guess we could turn it around, and say that "wait for this cancel scope to be exited" is only legal in the limited cases where it makes sense (e.g., the scope has to already be entered, can't have had any forks made, etc.). I'm not sure if that would be useful or not.

@belm0
Copy link
Member Author

belm0 commented Sep 20, 2018

I should say that in my recent case, the need to block on scope cancel() was thwarted. I reviewed the code and restructured it to avoid passing around nurseries or cancel scopes. (That's becoming my general principle with trio.) By making the program flow more "linear" I eliminated several cases where a cancel() call was needed.

@smurfix
Copy link
Contributor

smurfix commented Sep 20, 2018

Right – the options seem to be

  • don't have such a thing as an unbound cancel scope
  • don't provide a method that waits until there are no more tasks within the scope
  • prevent a waited-for scope from being entered
    • raise an exception when trying to wait on a scope that has never been entered?

I'm leaning towards the second option. Creating an rwlock-like object that provides variations of "wait-until-no-more-tasks" is easy – and completely orthogonal to the concept of cancel scopes. So if you want such a thing, do it explicitly.

Also, this way would allow for tasks that use the cancel scope but are ignored by the wait. For instance, you might have one read-and-queue task and a number of queue-processor tasks.

@njsmith
Copy link
Member

njsmith commented Sep 20, 2018

By making the program flow more "linear" I eliminated several cases where a cancel() call was needed.

That sounds like a good outcome! There are times when passing around nurseries and cancel scopes is necessary, but it's always better to do the simpler thing when you can...

@njsmith njsmith changed the title scope cancel() sharp edges Discussion: Cancel-myself-now operation? cancel-and-wait-for-it-to-finish operation? Jan 28, 2019
@njsmith
Copy link
Member

njsmith commented Jan 28, 2019

I guess the cancel-and-wait-for-it-to-finish idea is dead now that unbound cancel scopes are merged (#835). And the cancel-myself-now operation doesn't sound like it's found any compelling use cases. (Also, why not just raise <something meaningful>?) So I'm going to close this, at least for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants