-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[xray] Handle duplicate actor creation notifications nicely #2043
[xray] Handle duplicate actor creation notifications nicely #2043
Conversation
fd4fde2
to
e136a32
Compare
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, what are the situations where a double notification can happen (without a failure)? Are actor notifications broadcast multiple times?
jenkins retest this please |
The scenario where I've seen it happen so far is when a node manager receives an actor task that it doesn't have a location for yet, so it does a lookup. Then it receives the published notification along with the lookup response, so it gets a duplicate notification. I think it's necessary to keep the lookup step, though, in case there's a node that joins the cluster after some actor notifications have already been published. |
Got it, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Test PASSed. |
* master: Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030)
* master: Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030)
* fix-a3c-torch: Fix shape error in conv nets Ensure that values are flat list rm all use of torch Variables Create RemoteFunction class, remove FunctionProperties, simplify worker Python code. (ray-project#2052) Don't crash on duplicate actor notifications (ray-project#2043) rm unnecessary Variable wrapper Fixed attribute name in code example (ray-project#2054) [xray] Add Travis build for testing xray on Linux. (ray-project#2047) Added missing comma to code example (ray-project#2050) Use more CPUs for testMultipleWaitsAndGets. (ray-project#2051) use jobid_nil (ray-project#2044) Fix typo in tune. (ray-project#2046) Fix error in api.rst. (ray-project#2048) Improve shared_ptr usage (ray-project#2030) replace deprecated function Fmt Fix shapes of tensors Rename argument name to out_size Use correct pytorch functions Use F.softmax instead of a pointless network layer
What do these changes do?
An actor creation notification may be processed twice by a raylet if the initial publication is slow. The check for duplicate actor notifications is too strict, causing the raylet to crash if a duplicate notification is received. This PR relaxes the check to just make sure that if a duplicate notification is received, the node manager IDs match.