Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in actor task dispatch. #552

Merged
merged 2 commits into from
May 16, 2017
Merged

Fix bug in actor task dispatch. #552

merged 2 commits into from
May 16, 2017

Conversation

robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented May 16, 2017

Consider a script which (A) creates and actor and (B) submits a task to that actor.

(A) will send a message, call it message (1) (via Redis) to the local scheduler responsible for the actor. When message (1) arrives, the local scheduler will create a new worker to run the actor.

(B) will send a message, call it message (2) to the local scheduler responsible for the actor. When message (2) arrives, the local scheduler will queue the task and potentially give it to the newly created worker to execute.

However, in the rare situation where message (2) arrives before message (1), we cannot give the task to the worker yet because the worker hasn't been created. However, it looks like we were trying to do so.

Now that I think about it, another fix would be to just change dispatch_actor_task to return without doing anything if message (1) has not arrived yet instead of failing. UPDATE: I changed it to this approach.

/* This means that an actor has been assigned to this local scheduler, and a
* task for that actor has been received by this local scheduler, but this
* local scheduler has not yet processed the notification about the actor
* creation. This may be possible though should be very uncommon. If it does
* happen, it's ok. */
DCHECK(DBClientID_equal(state->actor_mapping[actor_id].local_scheduler_id,
get_db_client_id(state->db)));
} else {
LOG_INFO(
"handle_actor_task_scheduled called on local scheduler but the "
"corresponding actor_map_entry is not present. This should be rare.");
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might look funny, but I just moved the comment from the if to the else block. I think it was in the wrong place.

@stephanie-wang stephanie-wang merged commit 9018dff into ray-project:master May 16, 2017
@robertnishihara robertnishihara deleted the actorfix branch May 16, 2017 20:16
@robertnishihara
Copy link
Collaborator Author

@stephanie-wang it just occurred to me that this bug never showed up before the Redis sharding PR because the two messages went through the same Redis server and so happened in the order that they were issued. However, now that the two messages can go through different Redis shards, out of order stuff is more likely.

I vaguely remember that there are other places (I think in the local scheduler code) where we rely on the fact that certain Redis commands happen in order. This is a class of bugs that we'll have to be careful about.

@stephanie-wang
Copy link
Contributor

Yes, we'll have to go through all the Redis messages to make sure. One case I can think of off the top of my head is that the result_table_add call might be processed after a reconstruction call (currently a fatal error).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants