Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xray] Handle duplicate actor creation notifications nicely #2043

Conversation

stephanie-wang
Copy link
Contributor

What do these changes do?

An actor creation notification may be processed twice by a raylet if the initial publication is slow. The check for duplicate actor notifications is too strict, causing the raylet to crash if a duplicate notification is received. This PR relaxes the check to just make sure that if a duplicate notification is received, the node manager IDs match.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5339/
Test FAILed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, what are the situations where a double notification can happen (without a failure)? Are actor notifications broadcast multiple times?

@robertnishihara
Copy link
Collaborator

jenkins retest this please

@stephanie-wang
Copy link
Contributor Author

The scenario where I've seen it happen so far is when a node manager receives an actor task that it doesn't have a location for yet, so it does a lookup. Then it receives the published notification along with the lookup response, so it gets a duplicate notification.

I think it's necessary to keep the lookup step, though, in case there's a node that joins the cluster after some actor notifications have already been published.

@robertnishihara
Copy link
Collaborator

Got it, thanks.

Copy link
Contributor

@elibol elibol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5384/
Test PASSed.

@robertnishihara robertnishihara merged commit ad48e47 into ray-project:master May 14, 2018
@robertnishihara robertnishihara deleted the duplicate-actor-notification branch May 14, 2018 21:26
alok added a commit to alok/ray that referenced this pull request May 14, 2018
* master:
  Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052)
  Don't crash on duplicate actor notifications (ray-project#2043)
  Fixed attribute name in code example (ray-project#2054)
  [xray] Add Travis build for testing xray on Linux. (ray-project#2047)
  Added missing comma to code example (ray-project#2050)
  Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051)
  use jobid_nil (ray-project#2044)
  Fix typo in tune. (ray-project#2046)
  Fix error in api.rst. (ray-project#2048)
  Improve shared_ptr usage (ray-project#2030)
alok added a commit to alok/ray that referenced this pull request May 15, 2018
* master:
  Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052)
  Don't crash on duplicate actor notifications (ray-project#2043)
  Fixed attribute name in code example (ray-project#2054)
  [xray] Add Travis build for testing xray on Linux. (ray-project#2047)
  Added missing comma to code example (ray-project#2050)
  Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051)
  use jobid_nil (ray-project#2044)
  Fix typo in tune. (ray-project#2046)
  Fix error in api.rst. (ray-project#2048)
  Improve shared_ptr usage (ray-project#2030)
alok added a commit to alok/ray that referenced this pull request May 15, 2018
* fix-a3c-torch:
  Fix shape error in conv nets
  Ensure that values are flat list
  rm all use of torch Variables
  Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052)
  Don't crash on duplicate actor notifications (ray-project#2043)
  rm unnecessary Variable wrapper
  Fixed attribute name in code example (ray-project#2054)
  [xray] Add Travis build for testing xray on Linux. (ray-project#2047)
  Added missing comma to code example (ray-project#2050)
  Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051)
  use jobid_nil (ray-project#2044)
  Fix typo in tune. (ray-project#2046)
  Fix error in api.rst. (ray-project#2048)
  Improve shared_ptr usage (ray-project#2030)
  replace deprecated function
  Fmt
  Fix shapes of tensors
  Rename argument name to out_size
  Use correct pytorch functions
  Use F.softmax instead of a pointless network layer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants