-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add RFD for linux job worker service #1
Conversation
rfd/0001-linux-job-worker-service.md
Outdated
`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above. | ||
|
||
The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're missing a step in this list. While the new process will end up in a mount namespace, remounting /proc will happen on shared subtree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not following your comment. I've tried to re-word/organize this. It's not obvious to me what I'm missing here.
My understanding (very limited to be honest) is that when JobExecutorPath
running within a new mount namespace mounts a new proc filesystem that it is not reflected on the host's mount namespace. Yep, in the current design, if the user wants to run ps -ef
then they would see JobExecutorPath
as PID 1 and ps -ef
's PID. But no other processes will be displayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that it is not reflected on the host's mount namespace
I recommend testing out mounting /proc from a new mount namespace. Look at what happens to /proc on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I think I see what you're saying now. I added some clarity about specifying an unshare flag for mount namespace so new mounts don't show up within the host's mount namespace. Also, added clarity that the JobExecutorPath
should create a new directory to mount a new proc filesystem to.
Please let me know if that's what you were referring to. Sorry for the confusion - I definitely wasn't clear here.
If you're referring to another step that I'm missing here then I definitely do not know what I'm missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to mount /
in a private subtree within the namespace first, in order to prevent the /proc mount from applying in the host namespace. /
is often mounted as a shared subtree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely misunderstanding something.
If Start()
runs a JobExecutorPath
with the following flags:
cmd := exec.Command(JobExecutorPath, arguments...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWNET,
Unshareflags: syscall.CLONE_NEWNS,
}
and then JobExecutorPath
mounts a new proc filesystem (syscall.Mount("proc", "/proc", "proc", 0)
) then I'm not seeing any issues with the host's mounts or proc file system.
If I drop the Unshareflags
then when JobExecutorPath
mounts a new proc filesystem then I can confirm the host's proc filesystem is messed up. If the Unshareflags
isn't provided then doing what you said (JobExecutorPath
mounting /
with syscall.MS_PRIVATE|syscall.MS_REC
) does prevent messing up the host's proc filesystem as well.
Is there something else I'm missing here? I'm cool adding the mounting of /
with MS_PRIVATE|MS_REC
flags, but I don't understand what it's doing when the Unshareflags
is specified.
Edit: I removed the previously added step to create a temp directory for new proc. That's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare()
. All good if you're confident it's working 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got me curious. Running findmnt -o TARGET,PROPAGATION
in the host mount namespace shows /
is mounted as shared.
This is also true when running in the new mount namespace without specifying the unshare flag. If the unshare flag for CLONE_NEWNS
is provided then running findmnt -o TARGET,PROPAGATION
within the new mount namespace shows /
is mounted as private.
I'll try to look around for some documentation to back this up, mostly for my own curiosity.
rfd/0001-linux-job-worker-service.md
Outdated
|
||
`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information. | ||
|
||
##### func (*Job) Stream(context.Context) <-chan []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a channel the best choice for this API? What if I want to consume no more than a specific number of bytes? What if I want to stop receiving data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the context was to cancel/stop receiving data.
I need to think on this. I will definitely admit an API using a context and returning a channel feels very clunky/unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an API using a context and returning a channel feels very clunky/unintuitive
Exactly. Just focusing on the API design, how would you usually stream bytes in Go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the docs for Stream
to no longer take any parameters and now returns an OutputReader
that implements io.Reader
(and ultimately is able to read a process's output from another type, Output
). I've added more information about OutputReader
and Output
towards the bottom of the RFD.
Thank you for calling this out! This is a way better design that doesn't require the user to close or cancel anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general - just a few questions in the comments.
The only major thing I can think of is that I'd like to see a brief discussion of what the glue between the GRPC API and the Job
library will look like. Things like how the server application will
- track the running/finished jobs,
- extract the users CN from the client certificate and use that for authorization
- handle concurrent job start requests and suchlike.
rfd/0001-linux-job-worker-service.md
Outdated
|
||
##### func (*Job) Stop() error | ||
|
||
`Stop` will kill the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will the job processes be killed? Will the jobs be given a grace period to finish up once signalled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note that the process will be killed via kill. Since JobExecutorPath is PID 1 in the new pid namespace when it terminates all other processes running in the new pid namespace are also killed.
I also added a note that graceful termination could be handled in the future by sending SIGTERM to JobExuectorPath, which could handle SIGTERM by sending a SIGTERM signal to the child process (the user's desired command). If the process has not terminated in some amount of time then SIGKILL is used.
Please let me know if that's cool with you or if you would like to see graceful termination supported in this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Stop()
send SIGKILL seems unusual, especially since your RFD calls out support for long-running processes explicitly. Many commands are not intended to be killed by the kernel without any notice.
It's definitely important not to go too far out of scope for the challenge, but we expect the job lifecycle management to be relatively robust 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I'll update the RFD so that the library's Stop
sends SIGTERM
to the JobExecutorPath
(and appropriately send SIGTERM
to child process), wait some amount of time (probably something like 10 seconds as a hardcoded value for this), check if PID 1 is alive, and if it is then send SIGKILL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds great! And of course, think about how you'll deal with child processes.
Remember that you're providing the logic for JobExecutorPath, so if it receives SIGTERM, you'll want to specify how it handles that signal. If it quits immediately, all processes will get SIGKILL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the RFD to handle graceful shutdowns.
JobExecutorPath is now expected to handle SIGTERM. When it receives SIGTERM, it will send SIGTERM to the child process (user's command). JobExecutorPath must not terminate until the child process does.
On the library side, Stop
will send SIGTERM to JobExecutorPath. If after 10 seconds, JobExecutorPath is alive then Stop
will send SIGKILL to JobExecutorPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()
This prevents the host mount namespace from seeing the new proc filesystem mounted within the new mount namespace.
Add note that graceful termination could be supported by 1. Send a SIGTERM to JobExecutorPath 1. JobExecutorPath handles SIGTERM by sending SIGTERM to chid process (user's command) 1. Wait some amount of time 1. Send SIGKILL if JobExecutorPath is still around. 1. Since JobExecutorPath is PID 1 all processes in the new pid namespace will be killed.
I've added a few lines on how the server works.
Thank you so much for the feedback! |
Returning an OutputReader that implements io.Reader enables the user to decide how much data they want to read and when they want to be done reading without an unintuitive API of the previous design with contexts and channels.
Stop sends SIGTERM to `JobExecutorPath`. Waits 10 seconds. Then sends SIGKILL. JobExecutorPath should handle SIGTERM. Send SIGTERM to child process. Wait for child process to terminate before exiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more nits
This has more type saftey than sync.Map's any usage.
All great points - thank you for taking the time to make a thorough review! |
|
||
The Job's `Stream` function will create a new `OutputReader` per invocation that is provided the Job's `Output`. | ||
|
||
`OutputReader` will implement [io.Reader](https://pkg.go.dev/io#Reader). `OutputReader`'s `Read` will invoke the underlying `Output`'s `ReadAt` to read the next bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you may need an io.ReadCloser here, if you store bytes for each client. You can use as much memory as you'd like, but you can't leak memory from clients that have finished streaming.
rfd/0001-linux-job-worker-service.md
Outdated
|
||
##### func (*Job) Stop() error | ||
|
||
`Stop` will kill the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()
rfd/0001-linux-job-worker-service.md
Outdated
|
||
`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information. | ||
|
||
##### func (*Job) Stream(context.Context) <-chan []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.
rfd/0001-linux-job-worker-service.md
Outdated
`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above. | ||
|
||
The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare()
. All good if you're confident it's working 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait for @tcsc to approve before merging
This RFD lays out how to design a Linux job worker service for running arbitrary Linux processes with some degree of isolation and resource limits.
Users should be able to interact with the service via a CLI to start jobs, stop jobs, check status of jobs, and stream a job's output.