Skip to content

Reinstate: AMM ReduceReplicas to iterate only on replicated tasks #5341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 28, 2021

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Sep 22, 2021

@crusaderky
Copy link
Collaborator Author

Cherry picked candidate fix from #5337

@crusaderky
Copy link
Collaborator Author

crusaderky commented Sep 22, 2021

I tried adding

            parent._transitions({key: "released"}, client_msgs, worker_msgs)
+           assert not client_msgs and not worker_msgs
            self.send_all(client_msgs, worker_msgs)

but it doesn't trip any unit tests.
By reading the code I understand that the dicts should be filled by transition_processing_released, specifically at https://github.com/crusaderky/distributed/blob/3c2bcfd66a49bb7e092f9840c2438d9343bbdd94/distributed/scheduler.py#L3006-L3010

transition_erred_released and transition_memory_released also write to the msgs dicts, but the two states are not in worker.PROCESSING (since the message originates from here: https://github.com/crusaderky/distributed/blob/3c2bcfd66a49bb7e092f9840c2438d9343bbdd94/distributed/worker.py#L2748-L2750) so they would only happen in case of grave misalignment between worker and scheduler (I can't think of a legit use case where the worker thinks the task is still processing while the scheduler thinks it's already erred or in memory?)

@fjetter I'm out of ideas... help is appreciated

@crusaderky
Copy link
Collaborator Author

@fjetter correction: the below does not trip any unit tests at all!

+           assert False
            parent._transitions({key: "released"}, client_msgs, worker_msgs)
            self.send_all(client_msgs, worker_msgs)

It's a whole code path that is only triggered in production and this makes me extremely uncomfortable. However, it's been tested to work by our reporter and I think at this point it may be out of scope for this PR? What is your opinion about merging this as it is?

@crusaderky crusaderky marked this pull request as ready for review September 23, 2021 11:53
@crusaderky crusaderky changed the title [DO NOT MERGE] Reinstate: AMM ReduceReplicas to iterate only on replicated tasks Reinstate: AMM ReduceReplicas to iterate only on replicated tasks Sep 23, 2021
@fjetter
Copy link
Member

fjetter commented Sep 23, 2021

the below does not trip any unit tests at all!

exceptions in our transition state machine are not properly propagated. you might actually hit it but it is logged and lost. if you are "lucky" a unit test would deadlock but the unit test will not fail due to the exception. I'll take another look and see if I can provide any guidance.

What is your opinion about merging this as it is?

The problem is that we're removing any call to this handler in #5046 and we have also reports for deadlocking with that PR. However this is triggered, if we go on with the worker state refactoring, we'd introduce the regression without noticing

@crusaderky
Copy link
Collaborator Author

Split out #5359. Note that, as of this PR, stimulus_missing_data (which is dead code) still invokes the old, non-encapsulated changes to who_has/has_what.

@crusaderky
Copy link
Collaborator Author

All test failures unrelated. The code involved in the deadlock is no longer in this PR. Merging tomorrow unless anybody objects.

@crusaderky crusaderky merged commit cc41212 into dask:main Sep 28, 2021
@crusaderky crusaderky deleted the reducereplicas_on_repl_tasks branch September 28, 2021 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants