Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry and exception for hang on memory store full #5143

Merged
merged 20 commits into from
Jul 27, 2019

Conversation

richardliaw
Copy link
Contributor

@richardliaw richardliaw commented Jul 8, 2019

What do these changes do?

Aims to address #4878.

TODO:

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@richardliaw richardliaw changed the title Retry and exception for hang on memory store full [do not merge, wip] Retry and exception for hang on memory store full Jul 8, 2019
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15193/
Test PASSed.

@ericl
Copy link
Contributor

ericl commented Jul 8, 2019

Shouldn't this be done in plasma?

@richardliaw
Copy link
Contributor Author

Probably, this is just an initial investigation..

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15287/
Test FAILed.

@richardliaw
Copy link
Contributor Author

@ericl what should be done in plasma, the retries?

@richardliaw richardliaw changed the title [do not merge, wip] Retry and exception for hang on memory store full Retry and exception for hang on memory store full Jul 10, 2019
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15286/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1596/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1597/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1598/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15288/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15358/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1662/
Test FAILed.

@@ -931,16 +931,33 @@ def _process_task(self, task, function_execution_info):
finally:
self._current_task = None

# Store the outputs in the local object store.
retries_left = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this and delay constants in ray_constants.py.

if num_returns == 1:
outputs = (outputs, )
self._store_outputs_in_object_store(return_object_ids, outputs)
while retries_left:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually put this retry loop in a lower-level function, either Worker.put_object or Worker.store_and_register. The current fix might work for the return value problem, but we'll probably run into the same issue with ray.put. Also, it seems like this might not work if there are multiple return values and only one of them fails, since then we'll try to put all of the return values into the store again.

@stephanie-wang stephanie-wang self-assigned this Jul 15, 2019
@ericl
Copy link
Contributor

ericl commented Jul 25, 2019

Is this fixed?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15628/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15629/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15626/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15630/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15632/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1799/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1801/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1802/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1803/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/1805/
Test FAILed.

return np.zeros(10**7 + 2, dtype=np.uint8)

actor = LargeMemoryActor.remote()
with pytest.raises(ray.exceptions.RayActorError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this exception also be plasma.PlasmaStoreFull?

python/ray/tests/test_actor.py Outdated Show resolved Hide resolved
python/ray/tests/test_actor.py Outdated Show resolved Hide resolved
"so are are falling back to cloudpickle.".format(type(value)))
logger.warning(warning_message)
self.store_and_register(object_id, value)
def try_store_and_register():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a separate method on Worker instead of a closure?

logger.warning(warning_message)
self.store_and_register(object_id, value)

delay = ray_constants.DEFAULT_PUT_OBJECT_DELAY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here explaining what happens in the loop?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15658/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15665/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15687/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/15692/
Test PASSed.

@richardliaw
Copy link
Contributor Author

richardliaw commented Jul 27, 2019

Looks like fork_consistency fails quite often (but also does so in other PRs), but not in a way that seems to be relevant to this change.

@richardliaw richardliaw merged commit 9c00616 into ray-project:master Jul 27, 2019
@richardliaw richardliaw deleted the hangmem branch July 27, 2019 08:20
edoakes pushed a commit to edoakes/ray that referenced this pull request Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants