Skip to content

Increase robustness of Semaphore.release #4151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 15, 2020

Conversation

lr4d
Copy link
Contributor

@lr4d lr4d commented Oct 6, 2020

The Semaphore.release method does not handle connection errors gracefully. If a connection error appears, the exception may cause a catastrophic failure of the computation.

This PR adds retries to the semaphore release method, which must be
configured using "distributed.comm.retry.count". This may come to
use when the scheduler is under high load and some release
calls may fail.
This PR also avoids raising an exception in case a release fails,
instead a warning is logged at ERROR level, and the Semaphore.release
method will return False. The lease will eventually be cleaned up
during the semaphore lease validation check, which may be configured using
"distributed.scheduler.locks.lease-validation-interval" and
"distributed.scheduler.locks.lease-timeout".

Wrt the "Released too often" RuntimeError that affects #4147, one may argue that exception could also be converted into a log message, but I have decided not to change that until we have some more clarity on the background of that issue.

cc @fjetter

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, this has nothing to do with #4147. As it looks right now, #4147 might be an issue that the leases are never properly acquired. This is effectively the scenario this exception should protect you from, therefore, I'm inclined to keep this exception raising for now

@fjetter
Copy link
Member

fjetter commented Oct 8, 2020

While we're at it, the semaphore refresh leases coroutine might also have the potential to fail due to flaky connections. Since it is scheduler via the add_callback functionality, I'm not sure how the behaviour there will be. Can you have a look into this as well? (Might be more difficult to provoke in test cases, though)

@lr4d
Copy link
Contributor Author

lr4d commented Oct 8, 2020

@fjetter I'll have a look but would probably open a different PR for that

@lr4d
Copy link
Contributor Author

lr4d commented Oct 9, 2020

btw @fjetter while adding retries to the refresh-leases method, I've seen that these flaky connections can also raise an exception in Semaphore.get_value(). We can probably refactor the code a bit to have retry operations for all methods that involve scheduler communication in a cleaner way than it is done currently

lr4d added 5 commits October 9, 2020 12:56
This adds retries to the semaphore release method, which must be
configured using "distributed.comm.retry.count". This may come to
use when the scheduler is under high load and some release
calls may fail.
This PR also avoids raising an exception in case a release fails,
instead a warning is logged at ERROR level, and the Semaphore.release
method will return `False`. The lease will eventually be cleaned up
during the semaphore lease validation check, which may be configured using
"distributed.scheduler.locks.lease-validation-interval" and
""distributed.scheduler.locks.lease-timeout".
@lr4d lr4d force-pushed the add_retry_to_semaphore_release branch from 001107f to e950a51 Compare October 9, 2020 11:04
@fjetter
Copy link
Member

fjetter commented Oct 12, 2020

We can probably refactor the code a bit to have retry operations for all methods that involve scheduler communication in a cleaner way than it is done currently

One could probably put this directly in the client.scheduler hook but I suggest to do this in another PR. I guess this has non-trivial implications

@fjetter
Copy link
Member

fjetter commented Oct 12, 2020

Test failures seem to be unrelated. I re-triggered the run and reported #4163

@TomAugspurger
Copy link
Member

Hmm the failure happened again... I don't recall seeing it on other branches, but I agree that it doesn't look related.

I'm looking into it this morning.

@TomAugspurger
Copy link
Member

Maybe good news, that test failed in https://github.com/dask/distributed/pull/4164/checks?check_run_id=1242648960#step:8:115, so I think it can be ignored here. I'll continue to try and debug it.

@fjetter fjetter merged commit 2b43b40 into dask:master Oct 15, 2020
@lr4d lr4d deleted the add_retry_to_semaphore_release branch October 15, 2020 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants