Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation to BatchedSend and convert to asyncio #6389

Open
Tracked by #6384
gjoseph92 opened this issue May 20, 2022 · 3 comments
Open
Tracked by #6384

Add validation to BatchedSend and convert to asyncio #6389

gjoseph92 opened this issue May 20, 2022 · 3 comments

Comments

@gjoseph92
Copy link
Collaborator

BatchedSend has a few edge cases which can silently put it into a broken state. (See my review on #5457 for an enumeration of its issues.) Despite being a core piece of networking infrastructure, it has little internal validation or documentation of its methods' contracts. This means that future contributors can easily make innocuous-looking refactors to code that uses BatchedSend, and introduce edge-case race conditions or deadlocks without receiving any error. This is exactly what happened in #3493: a small change that would have failed simple assertions within BatchedSend introduced deadlocks (#5480) which were not resolved for two years.

We attempted to do this in #6329. However, that proved difficult to merge because BatchedSend is actually used in different ways, with different expectations, on the worker vs scheduler vs client (in large part, because it doesn't have this validation or clear contracts).

With #6350, the primary edge-case use of BatchedSend has been eliminated (restarting an instance with a new comm object). It should now be pretty easy to add some internal state validation and documentation to BatchedSend, to prevent problems like this in the future. In the process, we can also modernize convert Tornado and gen_coroutine uses to asyncio.

@fjetter
Copy link
Member

fjetter commented Jun 21, 2022

@gjoseph92 can you be more specific about what kind of validations are lacking and what needs to be done?

@gjoseph92
Copy link
Collaborator Author

See my review on #5457 for an enumeration of its issues

Pretty much the work of this ticket is identifying them, so I'd rather not list them all out here.

@gjoseph92
Copy link
Collaborator Author

FYI, a year ago I worked on switching BatchedSend to asyncio. It's pretty simple, but I also think it didn't quite work—there's a reason I didn't open a PR (possibly something around calling start from inside or outside the event loop?). Now that usage has changed with #6350, maybe this will be easier now.

https://github.com/dask/distributed/compare/main...gjoseph92:distributed:asyncio-batched-send?expand=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants