Increase robustness of Semaphore.release #4151

lr4d · 2020-10-06T16:49:52Z

The Semaphore.release method does not handle connection errors gracefully. If a connection error appears, the exception may cause a catastrophic failure of the computation.

This PR adds retries to the semaphore release method, which must be
configured using "distributed.comm.retry.count". This may come to
use when the scheduler is under high load and some release
calls may fail.
This PR also avoids raising an exception in case a release fails,
instead a warning is logged at ERROR level, and the Semaphore.release
method will return False. The lease will eventually be cleaned up
during the semaphore lease validation check, which may be configured using
"distributed.scheduler.locks.lease-validation-interval" and
"distributed.scheduler.locks.lease-timeout".

Wrt the "Released too often" RuntimeError that affects #4147, one may argue that exception could also be converted into a log message, but I have decided not to change that until we have some more clarity on the background of that issue.

cc @fjetter

distributed/semaphore.py

fjetter

For the record, this has nothing to do with #4147. As it looks right now, #4147 might be an issue that the leases are never properly acquired. This is effectively the scenario this exception should protect you from, therefore, I'm inclined to keep this exception raising for now

distributed/semaphore.py

distributed/tests/test_semaphore.py

fjetter · 2020-10-08T08:11:27Z

While we're at it, the semaphore refresh leases coroutine might also have the potential to fail due to flaky connections. Since it is scheduler via the add_callback functionality, I'm not sure how the behaviour there will be. Can you have a look into this as well? (Might be more difficult to provoke in test cases, though)

lr4d · 2020-10-08T18:35:21Z

@fjetter I'll have a look but would probably open a different PR for that

distributed/semaphore.py

lr4d · 2020-10-09T09:25:04Z

btw @fjetter while adding retries to the refresh-leases method, I've seen that these flaky connections can also raise an exception in Semaphore.get_value(). We can probably refactor the code a bit to have retry operations for all methods that involve scheduler communication in a cleaner way than it is done currently

This adds retries to the semaphore release method, which must be configured using "distributed.comm.retry.count". This may come to use when the scheduler is under high load and some release calls may fail. This PR also avoids raising an exception in case a release fails, instead a warning is logged at ERROR level, and the Semaphore.release method will return `False`. The lease will eventually be cleaned up during the semaphore lease validation check, which may be configured using "distributed.scheduler.locks.lease-validation-interval" and ""distributed.scheduler.locks.lease-timeout".

fjetter · 2020-10-12T09:08:44Z

We can probably refactor the code a bit to have retry operations for all methods that involve scheduler communication in a cleaner way than it is done currently

One could probably put this directly in the client.scheduler hook but I suggest to do this in another PR. I guess this has non-trivial implications

fjetter · 2020-10-12T09:18:58Z

Test failures seem to be unrelated. I re-triggered the run and reported #4163

TomAugspurger · 2020-10-12T11:53:38Z

Hmm the failure happened again... I don't recall seeing it on other branches, but I agree that it doesn't look related.

I'm looking into it this morning.

TomAugspurger · 2020-10-12T15:25:34Z

Maybe good news, that test failed in https://github.com/dask/distributed/pull/4164/checks?check_run_id=1242648960#step:8:115, so I think it can be ignored here. I'll continue to try and debug it.

lr4d commented Oct 6, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

fjetter requested changes Oct 7, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

distributed/semaphore.py Show resolved Hide resolved

distributed/tests/test_semaphore.py Outdated Show resolved Hide resolved

fjetter reviewed Oct 9, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

fjetter reviewed Oct 9, 2020

View reviewed changes

distributed/semaphore.py Outdated Show resolved Hide resolved

lr4d added 5 commits October 9, 2020 12:56

add traceback to release failure log

9aee397

catch generic exception, return true if release successful

a96aeef

remove print

b84be21

explain release return value

17802d2

lr4d mentioned this pull request Oct 9, 2020

WIP: Add Semaphore.refresh_leases retries #4158

Draft

docstring

e950a51

lr4d force-pushed the add_retry_to_semaphore_release branch from 001107f to e950a51 Compare October 9, 2020 11:04

fjetter mentioned this pull request Oct 12, 2020

Flaky test for BatchedSend error handling #4163

Closed

fjetter approved these changes Oct 12, 2020

View reviewed changes

fjetter merged commit 2b43b40 into dask:master Oct 15, 2020

lr4d deleted the add_retry_to_semaphore_release branch October 15, 2020 11:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Increase robustness of Semaphore.release #4151

Increase robustness of Semaphore.release #4151

Uh oh!

lr4d commented Oct 6, 2020 •

edited

Loading

Uh oh!

Uh oh!

fjetter left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fjetter commented Oct 8, 2020

Uh oh!

lr4d commented Oct 8, 2020

Uh oh!

Uh oh!

Uh oh!

lr4d commented Oct 9, 2020

Uh oh!

fjetter commented Oct 12, 2020

Uh oh!

fjetter commented Oct 12, 2020

Uh oh!

TomAugspurger commented Oct 12, 2020

Uh oh!

TomAugspurger commented Oct 12, 2020

Uh oh!

Uh oh!

Uh oh!

Increase robustness of Semaphore.release #4151

Increase robustness of Semaphore.release #4151

Uh oh!

Conversation

lr4d commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fjetter commented Oct 8, 2020

Uh oh!

lr4d commented Oct 8, 2020

Uh oh!

Uh oh!

Uh oh!

lr4d commented Oct 9, 2020

Uh oh!

fjetter commented Oct 12, 2020

Uh oh!

fjetter commented Oct 12, 2020

Uh oh!

TomAugspurger commented Oct 12, 2020

Uh oh!

TomAugspurger commented Oct 12, 2020

Uh oh!

Uh oh!

lr4d commented Oct 6, 2020 •

edited

Loading