Skip to content

Waiting on tasks on workers that no longer exist #6198

@mrocklin

Description

@mrocklin

I was chatting with @bnaul today. He's run into a stuck cluster and has an interesting situation.

The cluster is mostly done with everything, but there are about 100 tasks yet to complete. However currently nothing is running them.

Screen Shot 2022-04-25 at 1 52 36 PM

Looking at info pages it looks like there are a few tasks in the processing state

Screen Shot 2022-04-25 at 1 53 04 PM

Interestingly these tasks already know their type, so presumably they've run to completion before.

Also interestingly, if I click on the worker page processing that task I get a 404, meaning that the worker is no longer in the scheduler state.

Somehow, the scheduler thinks that a task is running on a worker that no longer exists.

I tried getting a story out of the scheduler, but got no results. I suspect that this is because we've run past the deque length. I've asked @bnaul to increase the length to infinity and we'll try again the next time there is a failure.

cc @fjetter

Metadata

Metadata

Assignees

No one assigned

    Labels

    deadlockThe cluster appears to not make any progress

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions