-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry and exception for hang on memory store full #5143
Conversation
Test PASSed. |
Shouldn't this be done in plasma? |
Probably, this is just an initial investigation.. |
Test FAILed. |
@ericl what should be done in plasma, the retries? |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
python/ray/worker.py
Outdated
@@ -931,16 +931,33 @@ def _process_task(self, task, function_execution_info): | |||
finally: | |||
self._current_task = None | |||
|
|||
# Store the outputs in the local object store. | |||
retries_left = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this and delay
constants in ray_constants.py
.
python/ray/worker.py
Outdated
if num_returns == 1: | ||
outputs = (outputs, ) | ||
self._store_outputs_in_object_store(return_object_ids, outputs) | ||
while retries_left: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should actually put this retry loop in a lower-level function, either Worker.put_object
or Worker.store_and_register
. The current fix might work for the return value problem, but we'll probably run into the same issue with ray.put
. Also, it seems like this might not work if there are multiple return values and only one of them fails, since then we'll try to put all of the return values into the store again.
Is this fixed? |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
python/ray/tests/test_actor.py
Outdated
return np.zeros(10**7 + 2, dtype=np.uint8) | ||
|
||
actor = LargeMemoryActor.remote() | ||
with pytest.raises(ray.exceptions.RayActorError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this exception also be plasma.PlasmaStoreFull
?
python/ray/worker.py
Outdated
"so are are falling back to cloudpickle.".format(type(value))) | ||
logger.warning(warning_message) | ||
self.store_and_register(object_id, value) | ||
def try_store_and_register(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a separate method on Worker
instead of a closure?
logger.warning(warning_message) | ||
self.store_and_register(object_id, value) | ||
|
||
delay = ray_constants.DEFAULT_PUT_OBJECT_DELAY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here explaining what happens in the loop?
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Looks like |
What do these changes do?
Aims to address #4878.
TODO:
Related issue number
Linter
scripts/format.sh
to lint the changes in this PR.