Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix actor garbage collection by breaking cyclic references #1064

Conversation

stephanie-wang
Copy link
Contributor

#902 introduced a cyclic reference between an ActorHandle instance and its dictionary of ActorMethod instances. This replaces the backreference to the actor handle with a weak reference, so that the Python garbage collector can safely collect an ActorHandle.

Fixes #1060.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Could we add the example in the issue as a test (or anything similar)?

@robertnishihara
Copy link
Collaborator

I added a test for #1060 (it doesn't seem like it's quite fixed but I'll look into it more) as well as for #783 (which is reintroduced here).

@@ -118,6 +118,7 @@ def _pid_alive(pid):
"""
try:
os.kill(pid, 0)
return True
Copy link
Collaborator

@robertnishihara robertnishihara Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of this return statement was a bug meaning that any calls to wait_for_pid_to_exit always succeeded. I'm fixing it here so it can be used in the actor test.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2030/
Test FAILed.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Oct 3, 2017

Ok, there was a kind of subtle bug. Basically, when a worker/actor dies, that potentially frees up CPU resources, so we need to attempt to dispatch some more tasks.

So I added a call to dispatch_all_tasks from handle_actor_worker_disconnect and from handle_worker_removed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2031/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2032/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2033/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2040/
Test FAILed.

@robertnishihara robertnishihara force-pushed the fix-actor-garbage-collection branch from 5ed46cd to f2c02d8 Compare October 3, 2017 22:28
@robertnishihara
Copy link
Collaborator

Just rebased, hopefully that will fix the jenkins tests.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2043/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2045/
Test FAILed.

@robertnishihara
Copy link
Collaborator

Somehow this is causing the jenkins many_drivers_test.py to fail. I don't see how it could be causing that, but still looking into it.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2054/
Test PASSed.

@@ -44,7 +44,7 @@ def driver(redis_address, driver_index):
for i in range(driver_index - max_concurrent_drivers + 1):
_wait_for_event("DRIVER_{}_DONE".format(i), redis_address)

def try_to_create_actor(actor_class, timeout=100):
def try_to_create_actor(actor_class, timeout=500):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, this test seems to fail unless we increase the timeout here. I still don't see why this PR affects this test.

In a follow up PR, I'll try releasing GPU resources as soon as an actor exits and see if that speeds up this test.

Another possibility is that the monitor is the bottleneck because it has to process so many drivers that are exiting.

@robertnishihara robertnishihara merged commit aebe9f9 into ray-project:master Oct 5, 2017
@robertnishihara robertnishihara deleted the fix-actor-garbage-collection branch October 5, 2017 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants