Fix actor garbage collection by breaking cyclic references #1064

stephanie-wang · 2017-10-03T00:37:01Z

#902 introduced a cyclic reference between an ActorHandle instance and its dictionary of ActorMethod instances. This replaces the backreference to the actor handle with a weak reference, so that the Python garbage collector can safely collect an ActorHandle.

Fixes #1060.

ericl

Looks good. Could we add the example in the issue as a test (or anything similar)?

robertnishihara · 2017-10-03T01:36:24Z

I added a test for #1060 (it doesn't seem like it's quite fixed but I'll look into it more) as well as for #783 (which is reintroduced here).

robertnishihara · 2017-10-03T01:37:25Z

python/ray/test/test_utils.py

@@ -118,6 +118,7 @@ def _pid_alive(pid):
    """
    try:
        os.kill(pid, 0)
+        return True


The lack of this return statement was a bug meaning that any calls to wait_for_pid_to_exit always succeeded. I'm fixing it here so it can be used in the actor test.

AmplabJenkins · 2017-10-03T02:38:57Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-03T02:38:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2030/
Test FAILed.

robertnishihara · 2017-10-03T03:19:29Z

Ok, there was a kind of subtle bug. Basically, when a worker/actor dies, that potentially frees up CPU resources, so we need to attempt to dispatch some more tasks.

So I added a call to dispatch_all_tasks from handle_actor_worker_disconnect and from handle_worker_removed.

AmplabJenkins · 2017-10-03T03:33:57Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-03T03:33:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2031/
Test FAILed.

AmplabJenkins · 2017-10-03T03:38:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-03T03:38:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2032/
Test FAILed.

AmplabJenkins · 2017-10-03T05:18:56Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-03T05:18:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2033/
Test FAILed.

AmplabJenkins · 2017-10-03T17:43:14Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-03T17:43:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2040/
Test FAILed.

robertnishihara · 2017-10-03T22:29:44Z

Just rebased, hopefully that will fix the jenkins tests.

AmplabJenkins · 2017-10-04T00:03:55Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-04T00:03:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2043/
Test FAILed.

AmplabJenkins · 2017-10-04T00:33:57Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-10-04T00:33:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2045/
Test FAILed.

robertnishihara · 2017-10-05T06:21:56Z

Somehow this is causing the jenkins many_drivers_test.py to fail. I don't see how it could be causing that, but still looking into it.

AmplabJenkins · 2017-10-05T07:20:26Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-10-05T07:20:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2054/
Test PASSed.

robertnishihara · 2017-10-05T07:27:43Z

test/jenkins_tests/multi_node_tests/many_drivers_test.py

@@ -44,7 +44,7 @@ def driver(redis_address, driver_index):
    for i in range(driver_index - max_concurrent_drivers + 1):
        _wait_for_event("DRIVER_{}_DONE".format(i), redis_address)

-    def try_to_create_actor(actor_class, timeout=100):
+    def try_to_create_actor(actor_class, timeout=500):


For some reason, this test seems to fail unless we increase the timeout here. I still don't see why this PR affects this test.

In a follow up PR, I'll try releasing GPU resources as soon as an actor exits and see if that speeds up this test.

Another possibility is that the monitor is the bottleneck because it has to process so many drivers that are exiting.

stephanie-wang requested review from ericl and robertnishihara October 3, 2017 00:37

ericl reviewed Oct 3, 2017

View reviewed changes

robertnishihara reviewed Oct 3, 2017

View reviewed changes

stephanie-wang and others added 6 commits October 3, 2017 15:28

Fix actor garbage collection by breaking cyclic references

79132f5

Fix bug in wait_for_pid_to_exit, add test for actor deletion.

3beabb7

Add test for calling actor method immediately after actor creation.

3b54170

Fix bug, must dispatch tasks when workers are killed.

a685f0e

Fix python test

fabe9eb

Fix cyclic reference problem by creating ActorMethod objects on the fly.

f2c02d8

robertnishihara force-pushed the fix-actor-garbage-collection branch from 5ed46cd to f2c02d8 Compare October 3, 2017 22:28

Try simply increasing the time allowed for many_drivers_test.py.

16fb51d

robertnishihara reviewed Oct 5, 2017

View reviewed changes

robertnishihara approved these changes Oct 5, 2017

View reviewed changes

robertnishihara merged commit aebe9f9 into ray-project:master Oct 5, 2017

robertnishihara deleted the fix-actor-garbage-collection branch October 5, 2017 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix actor garbage collection by breaking cyclic references #1064

Fix actor garbage collection by breaking cyclic references #1064

stephanie-wang commented Oct 3, 2017

ericl left a comment

robertnishihara commented Oct 3, 2017

robertnishihara Oct 3, 2017 •

edited

Loading

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

robertnishihara commented Oct 3, 2017 •

edited

Loading

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

robertnishihara commented Oct 3, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

robertnishihara commented Oct 5, 2017

AmplabJenkins commented Oct 5, 2017

AmplabJenkins commented Oct 5, 2017

robertnishihara Oct 5, 2017

Fix actor garbage collection by breaking cyclic references #1064

Fix actor garbage collection by breaking cyclic references #1064

Conversation

stephanie-wang commented Oct 3, 2017

ericl left a comment

Choose a reason for hiding this comment

robertnishihara commented Oct 3, 2017

robertnishihara Oct 3, 2017 • edited Loading

Choose a reason for hiding this comment

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

robertnishihara commented Oct 3, 2017 • edited Loading

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

AmplabJenkins commented Oct 3, 2017

robertnishihara commented Oct 3, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

AmplabJenkins commented Oct 4, 2017

robertnishihara commented Oct 5, 2017

AmplabJenkins commented Oct 5, 2017

AmplabJenkins commented Oct 5, 2017

robertnishihara Oct 5, 2017

Choose a reason for hiding this comment

robertnishihara Oct 3, 2017 •

edited

Loading

robertnishihara commented Oct 3, 2017 •

edited

Loading