-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent plasma store and manager from dying when a client dies. #203
Conversation
if p.poll is not None: | ||
continue | ||
successfully_shut_down = False | ||
process_types = ["workers", "local_schedulers", "plasma_managers", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this and the assertion and replace the thing iterated over in the loop with all_processes.keys()
|
||
def all_processes_alive(): | ||
return all([p.poll() is None for p in all_processes]) | ||
def all_processes_alive(exceptions=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"exception" reminds of programming exception, maybe excluded?
exceptions: Ignore the processes whose types are in this list. | ||
""" | ||
for process_type, processes in all_processes.items(): | ||
if not all([p.poll() is None for p in processes]) and process_type not in exceptions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the untrained eye, can you help leave the note that p.poll is None indicates an alive process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
("plasma_managers", []), | ||
("plasma_stores", []), | ||
("global_schedulers", []), | ||
("redis_servers", [])]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these string constants appear in a few places in the code. I think it might help to define them in the enum-like style in one place. E.g.,
PROCTYPE_WORKER="workers"
PROCTYPE_LS="local_schedulers"
PROCTYPE_PLASMA_MANAGER=...
PROCTYPE_GS=
PROCTYPE_REDIS_SERVER="redis_servers"
os.kill(p.pid, signal.SIGINT) # Give process signal to write profiler data. | ||
time.sleep(0.1) # Wait for profiling data to be written. | ||
p.kill() | ||
time.sleep(0.05) # is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to let the victim process get a chance to process its pending signals. So, minimally, would be good to yield. I think it could be sufficient to just do time.sleep(0)
if (write_message(w->sock, EXECUTE_TASK, task_spec_size(spec), | ||
(uint8_t *) spec) < 0) { | ||
/* TODO(rkn): If this happens, the task should be added back to the task | ||
* queue. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to just do it (add back to the queue) as part of this PR? You have the state
.
if (status < 0) { | ||
LOG_WARN("Failed to send a message to client on fd %d", client_sock); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this helper function is a bit of an overkill. Esp. given that its utility is limited, e.g. it couldn't be used in other places in this same PR where you want to do more than just log a warning message on socket i/o failure. Why not just stick with if ( command() < 0) { LOG_WARN(); /* other stmts*/ }
warn_if_failed(plasma_send_WaitReply( | ||
wait_req->client_conn->fd, manager_state->builder, | ||
wait_req->object_requests, wait_req->num_object_requests), | ||
wait_req->client_conn->fd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a general comment for all socket i/o handling in this PR. It might be better to specially handle errno == EPIPE and log a more useful warning message (e.g. remote client hung up). Later on, we may want to add more differentiation between various failure cases. So how about adding a handle_errno(errno) helper function here, initially with two cases: EPIPE and other, possibly even a single case borrowing errno msg logging code from the CHECK #define.
PROCCESS_TYPE_PLASMA_MANAGER = "plasma_manager" | ||
PROCCESS_TYPE_PLASMA_STORE = "plasma_store" | ||
PROCCESS_TYPE_GLOBAL_SCHEDULER = "global_scheduler" | ||
PROCCESS_TYPE_REDIS_SERVER = "redis_server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/proccess/process/g
But really, PROCTYPE would've been much better, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :)
@@ -76,11 +83,11 @@ def kill_process(p): | |||
os.kill(p.pid, signal.SIGINT) # Give process signal to write profiler data. | |||
time.sleep(0.1) # Wait for profiling data to be written. | |||
p.kill() | |||
time.sleep(0.05) # is this necessary? | |||
time.sleep(0.01) # is this necessary? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tried it with 0.0 ? Unless the time module is special-casing zero, this should result in a usleep syscall, which would yield the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sleep(0)
seems to work :)
…because the arguments to warn_if_sigpipe can be evaluated out of order.
No description provided.