-
-
Notifications
You must be signed in to change notification settings - Fork 734
Increase robustness of Semaphore.release #4151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, the semaphore refresh leases coroutine might also have the potential to fail due to flaky connections. Since it is scheduler via the |
@fjetter I'll have a look but would probably open a different PR for that |
btw @fjetter while adding retries to the refresh-leases method, I've seen that these flaky connections can also raise an exception in |
This adds retries to the semaphore release method, which must be configured using "distributed.comm.retry.count". This may come to use when the scheduler is under high load and some release calls may fail. This PR also avoids raising an exception in case a release fails, instead a warning is logged at ERROR level, and the Semaphore.release method will return `False`. The lease will eventually be cleaned up during the semaphore lease validation check, which may be configured using "distributed.scheduler.locks.lease-validation-interval" and ""distributed.scheduler.locks.lease-timeout".
001107f
to
e950a51
Compare
One could probably put this directly in the |
Test failures seem to be unrelated. I re-triggered the run and reported #4163 |
Hmm the failure happened again... I don't recall seeing it on other branches, but I agree that it doesn't look related. I'm looking into it this morning. |
Maybe good news, that test failed in https://github.com/dask/distributed/pull/4164/checks?check_run_id=1242648960#step:8:115, so I think it can be ignored here. I'll continue to try and debug it. |
The Semaphore.release method does not handle connection errors gracefully. If a connection error appears, the exception may cause a catastrophic failure of the computation.
This PR adds retries to the semaphore release method, which must be
configured using "distributed.comm.retry.count". This may come to
use when the scheduler is under high load and some release
calls may fail.
This PR also avoids raising an exception in case a release fails,
instead a warning is logged at ERROR level, and the Semaphore.release
method will return
False
. The lease will eventually be cleaned upduring the semaphore lease validation check, which may be configured using
"distributed.scheduler.locks.lease-validation-interval" and
"distributed.scheduler.locks.lease-timeout".
Wrt the "Released too often" RuntimeError that affects #4147, one may argue that exception could also be converted into a log message, but I have decided not to change that until we have some more clarity on the background of that issue.
cc @fjetter