Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent plasma store and manager from dying when a client dies. #203

Merged
merged 2 commits into from
Jan 18, 2017
Merged

Prevent plasma store and manager from dying when a client dies. #203

merged 2 commits into from
Jan 18, 2017

Conversation

robertnishihara
Copy link
Collaborator

No description provided.

if p.poll is not None:
continue
successfully_shut_down = False
process_types = ["workers", "local_schedulers", "plasma_managers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and the assertion and replace the thing iterated over in the loop with all_processes.keys()


def all_processes_alive():
return all([p.poll() is None for p in all_processes])
def all_processes_alive(exceptions=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"exception" reminds of programming exception, maybe excluded?

exceptions: Ignore the processes whose types are in this list.
"""
for process_type, processes in all_processes.items():
if not all([p.poll() is None for p in processes]) and process_type not in exceptions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the untrained eye, can you help leave the note that p.poll is None indicates an alive process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

("plasma_managers", []),
("plasma_stores", []),
("global_schedulers", []),
("redis_servers", [])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these string constants appear in a few places in the code. I think it might help to define them in the enum-like style in one place. E.g.,

PROCTYPE_WORKER="workers"
PROCTYPE_LS="local_schedulers"
PROCTYPE_PLASMA_MANAGER=...
PROCTYPE_GS=
PROCTYPE_REDIS_SERVER="redis_servers"

os.kill(p.pid, signal.SIGINT) # Give process signal to write profiler data.
time.sleep(0.1) # Wait for profiling data to be written.
p.kill()
time.sleep(0.05) # is this necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to let the victim process get a chance to process its pending signals. So, minimally, would be good to yield. I think it could be sufficient to just do time.sleep(0)

if (write_message(w->sock, EXECUTE_TASK, task_spec_size(spec),
(uint8_t *) spec) < 0) {
/* TODO(rkn): If this happens, the task should be added back to the task
* queue. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to just do it (add back to the queue) as part of this PR? You have the state.

if (status < 0) {
LOG_WARN("Failed to send a message to client on fd %d", client_sock);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this helper function is a bit of an overkill. Esp. given that its utility is limited, e.g. it couldn't be used in other places in this same PR where you want to do more than just log a warning message on socket i/o failure. Why not just stick with if ( command() < 0) { LOG_WARN(); /* other stmts*/ }

warn_if_failed(plasma_send_WaitReply(
wait_req->client_conn->fd, manager_state->builder,
wait_req->object_requests, wait_req->num_object_requests),
wait_req->client_conn->fd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a general comment for all socket i/o handling in this PR. It might be better to specially handle errno == EPIPE and log a more useful warning message (e.g. remote client hung up). Later on, we may want to add more differentiation between various failure cases. So how about adding a handle_errno(errno) helper function here, initially with two cases: EPIPE and other, possibly even a single case borrowing errno msg logging code from the CHECK #define.

PROCCESS_TYPE_PLASMA_MANAGER = "plasma_manager"
PROCCESS_TYPE_PLASMA_STORE = "plasma_store"
PROCCESS_TYPE_GLOBAL_SCHEDULER = "global_scheduler"
PROCCESS_TYPE_REDIS_SERVER = "redis_server"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/proccess/process/g
But really, PROCTYPE would've been much better, IMHO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch :)

@@ -76,11 +83,11 @@ def kill_process(p):
os.kill(p.pid, signal.SIGINT) # Give process signal to write profiler data.
time.sleep(0.1) # Wait for profiling data to be written.
p.kill()
time.sleep(0.05) # is this necessary?
time.sleep(0.01) # is this necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tried it with 0.0 ? Unless the time module is special-casing zero, this should result in a usleep syscall, which would yield the core.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep(0) seems to work :)

@pcmoritz pcmoritz merged commit 303d0fe into ray-project:master Jan 18, 2017
@pcmoritz pcmoritz deleted the robust branch January 18, 2017 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants