Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add RFD for linux job worker service #1

Merged
merged 22 commits into from
Jul 30, 2024
Merged

docs: add RFD for linux job worker service #1

merged 22 commits into from
Jul 30, 2024

Conversation

dustinspecker
Copy link
Owner

This RFD lays out how to design a Linux job worker service for running arbitrary Linux processes with some degree of isolation and resource limits.

Users should be able to interact with the service via a CLI to start jobs, stop jobs, check status of jobs, and stream a job's output.

rfd/0001-linux-job-worker-service.md Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved
`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above.

The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're missing a step in this list. While the new process will end up in a mount namespace, remounting /proc will happen on shared subtree.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not following your comment. I've tried to re-word/organize this. It's not obvious to me what I'm missing here.

My understanding (very limited to be honest) is that when JobExecutorPath running within a new mount namespace mounts a new proc filesystem that it is not reflected on the host's mount namespace. Yep, in the current design, if the user wants to run ps -ef then they would see JobExecutorPath as PID 1 and ps -ef's PID. But no other processes will be displayed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that it is not reflected on the host's mount namespace

I recommend testing out mounting /proc from a new mount namespace. Look at what happens to /proc on the host.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I think I see what you're saying now. I added some clarity about specifying an unshare flag for mount namespace so new mounts don't show up within the host's mount namespace. Also, added clarity that the JobExecutorPath should create a new directory to mount a new proc filesystem to.

Please let me know if that's what you were referring to. Sorry for the confusion - I definitely wasn't clear here.

If you're referring to another step that I'm missing here then I definitely do not know what I'm missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to mount / in a private subtree within the namespace first, in order to prevent the /proc mount from applying in the host namespace. / is often mounted as a shared subtree.

https://docs.kernel.org/filesystems/sharedsubtree.html

Copy link
Owner Author

@dustinspecker dustinspecker Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely misunderstanding something.

If Start() runs a JobExecutorPath with the following flags:

cmd := exec.Command(JobExecutorPath, arguments...)
cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags:   syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWNET,
    Unshareflags: syscall.CLONE_NEWNS,
}

and then JobExecutorPath mounts a new proc filesystem (syscall.Mount("proc", "/proc", "proc", 0)) then I'm not seeing any issues with the host's mounts or proc file system.

If I drop the Unshareflags then when JobExecutorPath mounts a new proc filesystem then I can confirm the host's proc filesystem is messed up. If the Unshareflags isn't provided then doing what you said (JobExecutorPath mounting / with syscall.MS_PRIVATE|syscall.MS_REC) does prevent messing up the host's proc filesystem as well.

Is there something else I'm missing here? I'm cool adding the mounting of / with MS_PRIVATE|MS_REC flags, but I don't understand what it's doing when the Unshareflags is specified.

Edit: I removed the previously added step to create a temp directory for new proc. That's not needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare(). All good if you're confident it's working 🙂

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got me curious. Running findmnt -o TARGET,PROPAGATION in the host mount namespace shows / is mounted as shared.

This is also true when running in the new mount namespace without specifying the unshare flag. If the unshare flag for CLONE_NEWNS is provided then running findmnt -o TARGET,PROPAGATION within the new mount namespace shows / is mounted as private.

I'll try to look around for some documentation to back this up, mostly for my own curiosity.


`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information.

##### func (*Job) Stream(context.Context) <-chan []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a channel the best choice for this API? What if I want to consume no more than a specific number of bytes? What if I want to stop receiving data?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the context was to cancel/stop receiving data.

I need to think on this. I will definitely admit an API using a context and returning a channel feels very clunky/unintuitive.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an API using a context and returning a channel feels very clunky/unintuitive

Exactly. Just focusing on the API design, how would you usually stream bytes in Go?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docs for Stream to no longer take any parameters and now returns an OutputReader that implements io.Reader (and ultimately is able to read a process's output from another type, Output). I've added more information about OutputReader and Output towards the bottom of the RFD.

Thank you for calling this out! This is a way better design that doesn't require the user to close or cancel anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.

rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved
@dustinspecker dustinspecker self-assigned this Jul 25, 2024
Copy link
Collaborator

@tcsc tcsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general - just a few questions in the comments.

The only major thing I can think of is that I'd like to see a brief discussion of what the glue between the GRPC API and the Job library will look like. Things like how the server application will

  • track the running/finished jobs,
  • extract the users CN from the client certificate and use that for authorization
  • handle concurrent job start requests and suchlike.

rfd/0001-linux-job-worker-service.md Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Show resolved Hide resolved

##### func (*Job) Stop() error

`Stop` will kill the process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the job processes be killed? Will the jobs be given a grace period to finish up once signalled?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note that the process will be killed via kill. Since JobExecutorPath is PID 1 in the new pid namespace when it terminates all other processes running in the new pid namespace are also killed.

I also added a note that graceful termination could be handled in the future by sending SIGTERM to JobExuectorPath, which could handle SIGTERM by sending a SIGTERM signal to the child process (the user's desired command). If the process has not terminated in some amount of time then SIGKILL is used.

Please let me know if that's cool with you or if you would like to see graceful termination supported in this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Stop() send SIGKILL seems unusual, especially since your RFD calls out support for long-running processes explicitly. Many commands are not intended to be killed by the kernel without any notice.

It's definitely important not to go too far out of scope for the challenge, but we expect the job lifecycle management to be relatively robust 🙂

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'll update the RFD so that the library's Stop sends SIGTERM to the JobExecutorPath (and appropriately send SIGTERM to child process), wait some amount of time (probably something like 10 seconds as a hardcoded value for this), check if PID 1 is alive, and if it is then send SIGKILL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great! And of course, think about how you'll deal with child processes.

Remember that you're providing the logic for JobExecutorPath, so if it receives SIGTERM, you'll want to specify how it handles that signal. If it quits immediately, all processes will get SIGKILL.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the RFD to handle graceful shutdowns.

JobExecutorPath is now expected to handle SIGTERM. When it receives SIGTERM, it will send SIGTERM to the child process (user's command). JobExecutorPath must not terminate until the child process does.

On the library side, Stop will send SIGTERM to JobExecutorPath. If after 10 seconds, JobExecutorPath is alive then Stop will send SIGKILL to JobExecutorPath.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()

This prevents the host mount namespace from seeing the new proc
filesystem mounted within the new mount namespace.
Add note that graceful termination could be supported by
1. Send a SIGTERM to JobExecutorPath
1. JobExecutorPath handles SIGTERM by sending SIGTERM to chid process
   (user's command)
1. Wait some amount of time
1. Send SIGKILL if JobExecutorPath is still around.
1. Since JobExecutorPath is PID 1 all processes in the new pid namespace
   will be killed.
@dustinspecker
Copy link
Owner Author

Looks good to me in general - just a few questions in the comments.

The only major thing I can think of is that I'd like to see a brief discussion of what the glue between the GRPC API and the Job library will look like. Things like how the server application will

  • track the running/finished jobs,
  • extract the users CN from the client certificate and use that for authorization
  • handle concurrent job start requests and suchlike.

I've added a few lines on how the server works.

  • server can store jobs in a map, preferably a sync.Map
  • I've added information about how the CN can be extracted in each handler via the context.
  • I've added a couple of notes about gRPC handlers each running in a new goroutine and the "job store" needs to be safe for concurrent use. I also added a note in the library that the Job struct should be safe for concurrent changes as well.

Thank you so much for the feedback!

Returning an OutputReader that implements io.Reader enables
the user to decide how much data they want to read and when they
want to be done reading without an unintuitive API of the previous
design with contexts and channels.
Stop sends SIGTERM to `JobExecutorPath`. Waits 10 seconds. Then sends
SIGKILL.

JobExecutorPath should handle SIGTERM. Send SIGTERM to child process.
Wait for child process to terminate before exiting.
Copy link
Collaborator

@tcsc tcsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more nits

rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Show resolved Hide resolved
rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved
@dustinspecker
Copy link
Owner Author

Just a few more nits

All great points - thank you for taking the time to make a thorough review!


The Job's `Stream` function will create a new `OutputReader` per invocation that is provided the Job's `Output`.

`OutputReader` will implement [io.Reader](https://pkg.go.dev/io#Reader). `OutputReader`'s `Read` will invoke the underlying `Output`'s `ReadAt` to read the next bytes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you may need an io.ReadCloser here, if you store bytes for each client. You can use as much memory as you'd like, but you can't leak memory from clients that have finished streaming.


##### func (*Job) Stop() error

`Stop` will kill the process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()


`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information.

##### func (*Job) Stream(context.Context) <-chan []byte
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.

`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above.

The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare(). All good if you're confident it's working 🙂

Copy link
Collaborator

@sclevine sclevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for @tcsc to approve before merging

@dustinspecker dustinspecker merged commit 3be3399 into main Jul 30, 2024
@dustinspecker dustinspecker deleted the rfd branch July 30, 2024 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants