Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reattach: don't kill process on failed reconnection #320

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 18, 2024

During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed.

This impacts Nomad when running the rawexec or exec task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233

schmichael
schmichael approved these changes Oct 18, 2024
During reattachment, we look to see if the process corresponding to the stored
PID is running. If so, we try to connect to that process. If that fails, we kill
the process under the presumption it's not working, and return
ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which
means that the process we're trying to attach to may have exited and a different
process has spawned with the same PID. This results in some unrelated process
getting silently killed.

This impacts Nomad when running the `rawexec` or `exec` task drivers, because
the Nomad agent spawns an "executor" process via go-plugin to control the
workloads, and these executors are left running when Nomad exits. If the
executors die in the meantime (or the host is rebooted), then we can potentially
kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin
server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233
@tgross tgross force-pushed the no-kill-on-failed-reconnect branch from 8bcfca2 to c1eccbd Compare October 21, 2024 13:32
@tgross
Copy link
Member Author

tgross commented Oct 21, 2024

Rebased on main to pick up your CI changes.

@tgross tgross merged commit df94fce into main Oct 21, 2024
3 checks passed
@peter-harmann-tfs
Copy link

peter-harmann-tfs commented Oct 22, 2024

@tgross Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe.

Storing also the process creation time and killing only if it matches should solve this, as PID + Process creation time together should be unique.

@tgross
Copy link
Member Author

tgross commented Oct 22, 2024

Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

@peter-harmann-tfs we had an internal chat about that case and the consensus was that if there was a bug such that the plugin process was live but not able to allow connections, you'd want to know that so it can be fixed rather than silently kill the process and have that bug blow up in your face later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants