-
-
Notifications
You must be signed in to change notification settings - Fork 734
Reinstate: AMM ReduceReplicas to iterate only on replicated tasks #5341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinstate: AMM ReduceReplicas to iterate only on replicated tasks #5341
Conversation
Cherry picked candidate fix from #5337 |
I tried adding parent._transitions({key: "released"}, client_msgs, worker_msgs)
+ assert not client_msgs and not worker_msgs
self.send_all(client_msgs, worker_msgs) but it doesn't trip any unit tests.
@fjetter I'm out of ideas... help is appreciated |
@fjetter correction: the below does not trip any unit tests at all! + assert False
parent._transitions({key: "released"}, client_msgs, worker_msgs)
self.send_all(client_msgs, worker_msgs) It's a whole code path that is only triggered in production and this makes me extremely uncomfortable. However, it's been tested to work by our reporter and I think at this point it may be out of scope for this PR? What is your opinion about merging this as it is? |
exceptions in our transition state machine are not properly propagated. you might actually hit it but it is logged and lost. if you are "lucky" a unit test would deadlock but the unit test will not fail due to the exception. I'll take another look and see if I can provide any guidance.
The problem is that we're removing any call to this handler in #5046 and we have also reports for deadlocking with that PR. However this is triggered, if we go on with the worker state refactoring, we'd introduce the regression without noticing |
Split out #5359. Note that, as of this PR, |
All test failures unrelated. The code involved in the deadlock is no longer in this PR. Merging tomorrow unless anybody objects. |
Reinstate #5297 after it was reverted in #5335.
DO NOT MERGE - this PR has been flagged to cause a scheduler deadlock; further investigation is needed.
XREFS