Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Structured concurrency #6201

Open
gjoseph92 opened this issue Apr 26, 2022 · 6 comments
Open

[Discussion] Structured concurrency #6201

gjoseph92 opened this issue Apr 26, 2022 · 6 comments
Labels
discussion Discussing a topic with no specific actions yet

Comments

@gjoseph92
Copy link
Collaborator

This issue is meant for higher-level discussion around dropping Tornado and our async infrastructure #6047.

I count 42 instances of loop.add_callback or loop.call_later in distributed. That's equivalent to saying we have 42 goto statements in distributed.

Unfamiliar with this comparison? Please read https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful. (This whole issue is really just saying "please read this". If you've read it, read it again. It took me a few reads for the ideas to really sink in.)

The basic idea is that when you use add_callback, or asyncio.create_task (or go in golang, etc.), you're starting something to run concurrently, but not dealing with the endgame:

  • what if the callback raises an error?
  • what if an error occurs somewhere else and you now need to cancel the callback?
  • how do you even know when the callback is done?

This can be fine. But because you can just fire off callbacks willy-nilly and not deal with their consequences or lifetimes, it's very hard to reason about and maintain code that does this. (For the same reason code that uses goto statements can work, but is hard to reason about and maintain.) When things go as expected, they probably work, but when anything unexpected happens, it's very easy to go off the rails. Of course, a distributed system is just the kind of system where unexpected things are nearly guaranteed to happen, and it's unacceptable to not handle them correctly when they do.

Here are some recent issues that I think, at their core, stem from using unstructured concurrency. That is, using structured concurrency, we'd either not (be likely/able to) design something that could get into this sort of broken state, or an unhandled error would propagate up and shut down the whole worker, instead of ignoring the error and hobbling along doing the wrong thing as though everything's fine:


I'm not saying "we need to rewrite all of distributed to use trio right now!" I recognize this isn't going to happen immediately.

But I think we should at least take the concepts of structured concurrency to heart. TaskGroups are getting added to asyncio in py3.11, and are available now in aiotools and anyio. When using concurrency, we should always:

  • Have a way for the coroutine to propagate errors upwards
  • Have a way to cancel the coroutine

This is kinda possible to do just with plain asyncio... it's just that to do it easily and consistently and reliably (in the face of signal handlers, __del__ methods, exceptions in finally blocks, etc.) you'd start wishing you were using trio.

Also, switching to using trio/anyio may not be as dramatic as you'd think. Trio can be hard to adopt because it doesn't support asyncio-based libraries (though anyio can help with this). But good news—we're not using any asyncio-based libraries (besides Tornado)!

And we might not have to immediately rewrite every loop.add_callback to do structured concurrency "properly" (an async with trio.nursery() block, etc.). Trio's clever "escape hatch" means we could just store a global Nursery object on the Worker/Scheduler/Client instance, and use it a lot like we use the current event loop. Replace self.loop.add_callback -> self.nursery.start_soon and we're in a strictly better state than we were before, because now previously-unhandled exceptions (aka fatal errors) in callbacks will propagate and cleanly shut down the whole worker/scheduler/client. Dramatic, but correct, and better than deadlocking. And then we can go gradually better-structure these places once the tools are available.

But, that said... structured concurrency is so fundamentally different from the traditional horrid mess of callbacks that there's debate on anyio around whether to even write a tutorial on how to port asyncio libraries: agronholm/anyio#245 (comment). It's considered unlikely that trio's advances could be brought back to asyncio. It's probably reasonable to expect that adopting structured concurrency would, eventually, mean many of our existing systems would get re-architected. But if we can do that incrementally (which we probably can with anyio), I think we'd end up in a much more reliable and maintainable state than it would be possible to have with the current unstructured callbacks.

@gjoseph92 gjoseph92 added the discussion Discussing a topic with no specific actions yet label Apr 26, 2022
@sjperkins
Copy link
Member

Unfamiliar with this comparison? Please read https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful. (This whole issue is really just saying "please read this". If you've read it, read it again. It took me a few reads for the ideas to really sink in.)

But I think we should at least take the concepts of structured concurrency to heart.

FWIW I'm +1 on this because this is good resource management. Interestingly, this is RAII applied to concurrency primitives in the form of TaskGroups/Nurseries. It's a fundamental paradigm for Modern C++ Programming and Rust enforces it by default. I note that Graydon Hoare, the initial Rust author commented on the post draft.

This is kinda possible to do just with plain asyncio... it's just that to do it easily and consistently and reliably (in the face of signal handlers, del methods, exceptions in finally blocks, etc.) you'd start wishing you were using trio.

This is one of the other benefits of RAII#Benefits. Aside from with statements, which are pure RAII, Python is garbage-collected by default so there's a tendency to always rely on it for resource management which lead to the above cleanup issues. It's great that Nurseries/TaskGroups have been integrated into a with statement.

Finally, I wonder if a distributed Nursery/TaskGroup primitive? (I'm guessing this is unrelated to TaskGroups in the dask/distributed scheduler).

@dhirschfeld
Copy link
Contributor

dhirschfeld commented May 17, 2022

As a trio user myself (for all of the above reasons) I'd love to be able to easily use trio with distributed so that my own code can gain the benefits of structured concurrency 🚀

There may be some useful design ideas in the trio-parallel project, or tractor.

@graingert
Copy link
Member

@dhirschfeld tractor does indeed look great - however the licence is currently AGPL https://github.com/goodboy/tractor/blob/71f19f217d4b9311d09f9e51bdaa23fa9fe5e76e/LICENSE

@dhirschfeld
Copy link
Contributor

however the licence is currently AGPL

Right, best not to look too closely at that one then! 😅

graingert added a commit to graingert/distributed that referenced this issue Aug 5, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 5, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 6, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 9, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 9, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 12, 2022
@graingert
Copy link
Member

@graingert
Copy link
Member

graingert added a commit to graingert/distributed that referenced this issue Aug 16, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 17, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 18, 2022
graingert added a commit to graingert/distributed that referenced this issue Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

4 participants