Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deadlock on channel mutex when stopping pool #148

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

emaxx-google
Copy link

@emaxx-google emaxx-google commented Dec 20, 2024

Let the message_manager_loop run at least until the PoolManager is fully stopped, to avoid the workers from hanging on writing to an (overflown) pipe while holding the channel mutex, which would result in a deadlock - causing a delay of 60 seconds, the BrokenProcessPool exception and an unclean shutdown of workers.

This fixes #147.

Let the message_manager_loop run at least until the PoolManager is
fully stopped, to avoid the workers from hanging on writing to an
(overflown) pipe while holding the channel mutex, which would result
in a deadlock - causing a delay of 60 seconds, the BrokenProcessPool
exception and an unclean shutdown of workers.
f.cancel()
time.sleep(EPS * CNT / 2)
pool.stop()
pool.join()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't found a good way to assert the state here: neither that there was no BrokenProcessPool in pool_manager_loop() nor that we slept for LOCK_TIMEOUT:

  • For the former, I tried changing the PoolContext.status setter to allow transitioning from STOPPED to ERROR, but apparently there are other places that try to set ERROR even in "good" shutdown scenarios.
  • For the latter, it seems brittle to rely on clocks in the tests. Unless maybe we override LOCK_TIMEOUT to some really big number in this test?..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ProcessPool join hangs for 60 seconds due to intermittent deadlock
1 participant