Actor dummy object garbage collection #3593

stephanie-wang · 2018-12-21T00:53:39Z

What do these changes do?

This minimizes the number of pinned dummy objects per actor by only pinning the ones necessary to execute new actor tasks. This implements the solution described in #3308. Note that this does not solve the issue of garbage-collecting actor handles that are no longer needed, or the issue of general actor garbage collection.

Related issue number

Closes #3308.

…t-gc

AmplabJenkins · 2018-12-21T01:37:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10236/
Test FAILed.

ericl · 2018-12-21T04:31:45Z

How do we test the map size doesn't grow indefinitely? I don't think the existing tests cover this. Run python code and check the C++ stats?

raulchen

I took a brief look and left a few comments. Will continuing reviewing tmr.

Also, there're a bunch of unrelated changes (converting UniqueID::nil() to UniqueId()), are they unintentional?

python/ray/actor.py

raulchen · 2018-12-21T11:58:11Z

src/ray/raylet/node_manager.cc

+          // TODO(swang): We use a copy of the task so that for actor tasks, we
+          // can keep the original execution dependencies in the copy in the
+          // scheduling queue, but ideally the task in the lineage cache would
+          // match the queued task exactly.


why do we need to keep the original execution dependencies in the copy in the scheduling queue now? and why was this not needed before this PR?

Yeah this is a little unfortunate...we need access to the original execution dependency because that is the object that gets released in the actor's frontier once the task completes. I thought about doing the dummy object accounting when the task is assigned, but you don't know yet whether the task succeeded.

Oh hmm actually just realized that we can separate the logic so that we release the previous object when the task is assigned and add the new object when the task finishes. Then I can get rid of this code. :)

raulchen · 2018-12-21T12:17:57Z

src/ray/raylet/node_manager.cc

+  // until this first task is submitted.
+  for (auto &new_handle_id : task.GetTaskSpecification().NewActorHandles()) {
+    // An actor creation task is the first task, so it cannot have new handles.
+    RAY_CHECK(task.GetTaskSpecification().IsActorTask());


nit, this might be cleaner:
RAY_CHECK(!(task.GetTaskSpecification().IsActorCreationTask() && task.GetTaskSpecification().NewActorHandles().size() > 0))

stephanie-wang · 2018-12-27T22:19:04Z

@ericl, yeah we don't have a way to automatically test this right now. Can you try it out on one of the RLlib workloads that would be affected?

This reverts commit 3da85e5.

…t-gc

AmplabJenkins · 2018-12-27T23:41:41Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10464/
Test FAILed.

AmplabJenkins · 2018-12-27T23:46:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10465/
Test FAILed.

AmplabJenkins · 2018-12-28T22:24:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10485/
Test FAILed.

AmplabJenkins · 2018-12-28T22:29:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10486/
Test FAILed.

robertnishihara

Thanks @stephanie-wang, this solution seems pretty nice.

robertnishihara · 2018-12-28T23:21:11Z

src/ray/raylet/actor_registration.cc

+                                  const ObjectID &execution_dependency) {
+  if (frontier_.find(handle_id) == frontier_.end()) {
+    frontier_[handle_id] = FrontierLeaf{
+        .task_counter = 0, .execution_dependency = execution_dependency,


I can't remember the pros/cons, but I don't think we're using this form of initialization anywhere else in the codebase currently.

Hmm, I wasn't aware of any pros/cons. Do you have any suggestions in mind?

Ok, when we converted from C to C++, e.g., in #321, we had to make some changes like

https://github.com/ray-project/ray/pull/321/files#diff-450748b4523710897a16506dabc79950L573

and

https://github.com/ray-project/ray/pull/321/files#diff-303e99e77aad9f39624c3769dd615ea0L220

which switched away from using this syntax. I thought it wasn't valid C++, though if it's compiling in your PR maybe I'm remembering it incorrectly. I'm sure @mehrdadn knows the answer.

As for suggestions, I'd defer to other people here. If it's valid C++11 then it's probably fine especially since it's super simple. I guess the alternative would be to manually set the fields after the declaration or to define a simple constructor.

These are called designated initializers in C. They were extensions I believe until C99? I had no idea but apparently they've finally made it into C++20. I'd avoid using them given they're not valid C++11.

Thanks @mehrdadn!

Ok, in that case let's either do

frontier_[handle_id] = FrontierLeaf(); frontier_[handle_id].task_counter = 0; frontier_[handle_id].execution_dependency = execution_dependency;

or define a constructor.

If you want to initialize fields externally I'd get a reference and avoid looking up the same object every time. If you make a constructor that should be fine too, though semantically it'd be creating a new object to move onto the old object rather than modifying it or constructing it in place. (Note that using operator[] to index into a map automatically calls the default constructor, so if you don't want to call that, you'd want to use insert.)

Okay, thanks!

robertnishihara · 2018-12-29T02:08:16Z

src/ray/raylet/node_manager.cc

+  RAY_CHECK(actor_entry != actor_registry_.end());
+  // Extend the actor's frontier to include the executed task.
+  auto dummy_object = task.GetTaskSpecification().ActorDummyObject();
+  ObjectID object_to_release =


same with dummy_object above

test/actor_test.py

robertnishihara · 2018-12-29T06:08:45Z

src/ray/raylet/node_manager.cc

+  ActorHandleID actor_handle_id;
+  if (task.GetTaskSpecification().IsActorCreationTask()) {
+    actor_id = task.GetTaskSpecification().ActorCreationId();
+    actor_handle_id = ActorHandleID();


Maybe clearer to use ActorHandleID::nil() here. Alternatively could remove the line entirely.

src/ray/raylet/actor_registration.h

robertnishihara · 2018-12-29T06:30:42Z

python/ray/actor.py

+        # NOTE(swang): If the new actor handle fails to be used (e.g., due
+        # to a failure to register a named actor), then this may cause a
+        # memory leak in the backend.
+        self._ray_new_actor_handles.append(actor_handle_id)


In the pickling case, we set actor_handle_id = self._ray_actor_handle_id. Why not use a nil ID (or some fixed ID like that) instead?

Hmm actually I guess we could just set a random actor handle ID in Python. We don't want to use nil because that represents the original actor handle.

robertnishihara · 2018-12-29T06:31:50Z

python/ray/actor.py

+        # not release the cursor for any new handles until the first task for
+        # each of the new handles is submitted.
+        # NOTE(swang): If the new actor handle fails to be used (e.g., due
+        # to a failure to register a named actor), then this may cause a


This could happen even without any failure to register a named actor, right? E.g.,

a = Actor.remote() @ray.remote def f(a): return f.remote(a)

What precisely gets leaked? The dummy object ID that the forked ID depends on?

Hmm yeah, I guess the more general problem is GC for actor handles.

src/ray/raylet/actor_registration.h

AmplabJenkins · 2019-01-03T02:31:36Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10558/
Test FAILed.

robertnishihara · 2019-01-03T07:24:46Z

@stephanie-wang this looks good to me, though it looks like Travis is failing.

AmplabJenkins · 2019-01-04T04:39:48Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10578/
Test FAILed.

raulchen · 2019-01-04T08:17:51Z

src/ray/gcs/format/gcs.fbs

+  // If this is an actor task, then this will be populated with all of the new
+  // actor handles that were forked from this handle since the last task on
+  // this handle was submitted.
+  new_actor_handles: [string];


The Java CI failure is because when we change TaskInfo definition, the corresponding Java file needs to be changed as well.

Unfortunately, for now, generating this file isn't automatic. Because the generated file lacks a certain API and needs some manual patch. I proposed adding the required API to flatbuffers. But they're not willing to do that (google/flatbuffers#5092).

For this PR, you can apply this patch to fix https://gist.github.com/raulchen/1fcecac74ddb809b836ab9bbf9f9e178.

We'll add a script to simplify this process.

The TLDR explanation of this issue is: we define some ObjectID fields as string in fbs. But these fields are actually byte arrays. For C++ and Python 2, string and byte array are equivalent. But for Java and Python 3, they're different.

Flatbuffers Java API will decode the bytes with UTF8 if a field is defined as string, but the content is actually not UTF8-decodable. Python API doesn't do that for now, but may do in the future. Because Flatbuffers prefers to distinguish strings and byte arrays. If so, our Python code will face the same problem.

An alternative solution is to store hex strings in fbs. But that might require substantial code changes.

Ah, thanks!

raulchen · 2019-01-04T10:49:32Z

src/ray/raylet/node_manager.cc

+        GetExpectedTaskCounter(actor_registry_, spec.ActorId(), spec.ActorHandleId());
+    RAY_CHECK(spec.ActorCounter() == expected_task_counter)
+        << "Expected actor counter: " << expected_task_counter
+        << ", got: " << spec.ActorCounter();


Also print the task id here? I found it very useful for debugging.

Sure, thank you!

raulchen · 2019-01-04T11:32:43Z

src/ray/raylet/actor_registration.h

+  /// Once all handles have released a dummy object, it will be removed from
+  /// this map. This object is safe to evict, since no handle will submit
+  /// another method dependent on that object.
+  std::unordered_map<ObjectID, int64_t> dummy_objects_;


It took me quiet a while to understand the above comment.
Base on my understanding, I'm trying to summarize the comment and make it easier to understand for future readers.

This map is used to track all the unreleased dummy objects for this actor. The map key is the dummy object ID, and the map value is the number of actor handles that depends on this dummy object. When the map value decreases to 0, the dummy object is safe to release from the object manager.

An actor handle depends on a dummy object when its next unfinished task depends on the dummy object. For a given dummy object (say D), there could be 2 types of such actor handles:

The actor handle (say H) on which D's creating task (say T) was submitted. If T's next task hasn't finished yet, H still depends on D.

Any handles that were forked from H after T finished, and before T's next task finishes. Such handles depend on H until their first tasks finish.

Thanks! I agree that this is clearer.

I can not understand this very well. Because all calls in an Actor form a line in timeline, then should not all the handles for the same Actor depends on the latest same dummy object instead of each handle's last task dummy object?

Hmm I'm not sure exactly what you're asking, but the dummy objects represent the last task that was known to execute on an actor. So when a task is submitted, it depends on the last task submitted on the same handle. When a task is executed, we might update that dependency to reflect the last task that was actually executed on the actor, which may have come from a different handle.

Then I mean why should we keep each dependency for all ActorHandles of the same Actor instead of just keeping the latest one? Because we will always update the ActorHandle dependency to the latest one when execute the actor task, then what is the sense to keep a dependency for each ActorHandle when submit the task?

I think I understand your question now. We keep the dependency for each ActorHandle so that we know which tasks have executed so far and which task from each handle can execute next. The former is mostly important for failure scenarios (e.g., making sure you don't re-execute the same task twice on an actor), and the latter comes up during normal execution when there are multiple tasks from the same handle that could execute.

Both of these things don't strictly need the "dependency"; you could accomplish the same thing with task counters for each handle. However, it's convenient to use dependencies since this is also how we determine whether non-actor tasks can be executed.

That is to say, when re-execute an Actor Method Sequences after failure, we can only guarantee call order in one ActorHandle, but not in global order(across all handles) with the last Actor method sequences?

…t-gc

stephanie-wang · 2019-01-05T04:30:51Z

Thanks for the comments, @raulchen! I tried to address all of them, so please let me know what you think. I tried for a bit to figure out how to implement the same logic that I did in Python in the Java client, but wasn't quite sure how to proceed. Maybe you or @guoyuhong could try it in a separate PR? I'm also happy to do it, but just need some pointers on where I should be looking.

AmplabJenkins · 2019-01-05T06:22:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10604/
Test FAILed.

raulchen · 2019-01-05T07:38:26Z

@stephanie-wang Java part looks good. We'll implement the logic in a separate PR later.

…t-gc

AmplabJenkins · 2019-01-08T06:03:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10662/
Test PASSed.

stephanie-wang added 14 commits December 17, 2018 15:33

Convert UniqueID::nil() to a constructor

369d73d

Cleanup actor handle pickling code

2ce3ffc

Add new actor handles to the task spec

21aa182

Pass in new actor handles

8ba002a

Add new handles to the actor registration

ae2a3be

Regression test for actor handle forking and GC

e21f2b7

lint and doc

dc3d750

Handle pickled actor handles in the backend and some refactoring

a7b5164

Add regression test for dummy object GC and pickled actor handles

c3d521f

Check for duplicate actor tasks on submission

17cf5eb

Regression test for forking twice, fix failed named actor leak

7c85f74

Fix bug for forking twice

3da85e5

Merge remote-tracking branch 'upstream/master' into actor-dummy-objec…

8e001b4

…t-gc

lint

81b811a

stephanie-wang requested review from robertnishihara, ericl and raulchen December 21, 2018 00:53

raulchen reviewed Dec 21, 2018

View reviewed changes

ericl added the stability-blocker label Dec 22, 2018

stephanie-wang added 4 commits December 27, 2018 14:28

Revert "Fix bug for forking twice"

9ce972b

This reverts commit 3da85e5.

Add new actor handles when task is assigned, not finished

061ac11

Merge remote-tracking branch 'upstream/master' into actor-dummy-objec…

453b39d

…t-gc

Remove comment

0eaf057

remove UniqueID()

374a651

stephanie-wang force-pushed the actor-dummy-object-gc branch from cc7ad15 to 374a651 Compare December 28, 2018 22:09

robertnishihara reviewed Dec 29, 2018

View reviewed changes

Updates

609420a

robertnishihara approved these changes Jan 3, 2019

View reviewed changes

update

4d47cc4

raulchen reviewed Jan 4, 2019

View reviewed changes

stephanie-wang added 4 commits January 4, 2019 19:19

fix

d908d08

Merge remote-tracking branch 'upstream/master' into actor-dummy-objec…

ae68411

…t-gc

fix java

441415e

fixes

8d8913d

stephanie-wang added 2 commits January 7, 2019 19:04

Merge remote-tracking branch 'upstream/master' into actor-dummy-objec…

94118fb

…t-gc

fix

4059cce

stephanie-wang merged commit 04f31db into ray-project:master Jan 9, 2019

stephanie-wang deleted the actor-dummy-object-gc branch January 9, 2019 18:37

jovany-wang mentioned this pull request Jan 22, 2019

Implement actor dummy object gc in java #3822

Merged

raulchen mentioned this pull request Jun 26, 2019

[Core worker] Serialize ActorHandle in core worker. Make ActorHandle thread safe. #5034

Merged

1 task

Actor dummy object garbage collection #3593

Actor dummy object garbage collection #3593

Uh oh!

Conversation

stephanie-wang commented Dec 21, 2018

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Dec 21, 2018

Uh oh!

ericl commented Dec 21, 2018

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Dec 27, 2018

Uh oh!

AmplabJenkins commented Dec 27, 2018

Uh oh!

AmplabJenkins commented Dec 27, 2018

Uh oh!

AmplabJenkins commented Dec 28, 2018

Uh oh!

AmplabJenkins commented Dec 28, 2018

Uh oh!

robertnishihara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehrdadn Jan 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AmplabJenkins commented Jan 3, 2019

Uh oh!

robertnishihara commented Jan 3, 2019

Uh oh!

AmplabJenkins commented Jan 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mehrdadn Jan 3, 2019 •

edited

Loading