Fix bug in actor task dispatch. #552

robertnishihara · 2017-05-16T05:02:32Z

Consider a script which (A) creates and actor and (B) submits a task to that actor.

(A) will send a message, call it message (1) (via Redis) to the local scheduler responsible for the actor. When message (1) arrives, the local scheduler will create a new worker to run the actor.

(B) will send a message, call it message (2) to the local scheduler responsible for the actor. When message (2) arrives, the local scheduler will queue the task and potentially give it to the newly created worker to execute.

However, in the rare situation where message (2) arrives before message (1), we cannot give the task to the worker yet because the worker hasn't been created. However, it looks like we were trying to do so.

Now that I think about it, another fix would be to just change dispatch_actor_task to return without doing anything if message (1) has not arrived yet instead of failing. UPDATE: I changed it to this approach.

…t arrived. Also fix comment.

robertnishihara · 2017-05-16T06:08:29Z

src/local_scheduler/local_scheduler_algorithm.cc

    /* This means that an actor has been assigned to this local scheduler, and a
     * task for that actor has been received by this local scheduler, but this
     * local scheduler has not yet processed the notification about the actor
     * creation. This may be possible though should be very uncommon. If it does
     * happen, it's ok. */
-    DCHECK(DBClientID_equal(state->actor_mapping[actor_id].local_scheduler_id,
-                            get_db_client_id(state->db)));
-  } else {
    LOG_INFO(
        "handle_actor_task_scheduled called on local scheduler but the "
        "corresponding actor_map_entry is not present. This should be rare.");


This might look funny, but I just moved the comment from the if to the else block. I think it was in the wrong place.

robertnishihara · 2017-05-16T20:19:15Z

@stephanie-wang it just occurred to me that this bug never showed up before the Redis sharding PR because the two messages went through the same Redis server and so happened in the order that they were issued. However, now that the two messages can go through different Redis shards, out of order stuff is more likely.

I vaguely remember that there are other places (I think in the local scheduler code) where we rely on the fact that certain Redis commands happen in order. This is a class of bugs that we'll have to be careful about.

stephanie-wang · 2017-05-16T20:48:03Z

Yes, we'll have to go through all the Redis messages to make sure. One case I can think of off the top of my head is that the result_table_add call might be processed after a reconstruction call (currently a fatal error).

robertnishihara added 2 commits May 15, 2017 22:01

Fix bug in actor task dispatch.

55a1fee

Return early from dispatch_actor_task if creation notification has no…

56ed53d

…t arrived. Also fix comment.

robertnishihara commented May 16, 2017

View reviewed changes

stephanie-wang merged commit 9018dff into ray-project:master May 16, 2017

robertnishihara deleted the actorfix branch May 16, 2017 20:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in actor task dispatch. #552

Fix bug in actor task dispatch. #552

robertnishihara commented May 16, 2017 •

edited

Loading

robertnishihara May 16, 2017

robertnishihara commented May 16, 2017

stephanie-wang commented May 16, 2017

Fix bug in actor task dispatch. #552

Fix bug in actor task dispatch. #552

Conversation

robertnishihara commented May 16, 2017 • edited Loading

robertnishihara May 16, 2017

Choose a reason for hiding this comment

robertnishihara commented May 16, 2017

stephanie-wang commented May 16, 2017

robertnishihara commented May 16, 2017 •

edited

Loading