docs: add RFD for linux job worker service #1

dustinspecker · 2024-07-24T21:17:00Z

This RFD lays out how to design a Linux job worker service for running arbitrary Linux processes with some degree of isolation and resource limits.

Users should be able to interact with the service via a CLI to start jobs, stop jobs, check status of jobs, and stream a job's output.

rfd/0001-linux-job-worker-service.md

sclevine · 2024-07-25T04:41:18Z

rfd/0001-linux-job-worker-service.md

+`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above.
+
+The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following:
+


I think you're missing a step in this list. While the new process will end up in a mount namespace, remounting /proc will happen on shared subtree.

Sorry, I'm not following your comment. I've tried to re-word/organize this. It's not obvious to me what I'm missing here.

My understanding (very limited to be honest) is that when JobExecutorPath running within a new mount namespace mounts a new proc filesystem that it is not reflected on the host's mount namespace. Yep, in the current design, if the user wants to run ps -ef then they would see JobExecutorPath as PID 1 and ps -ef's PID. But no other processes will be displayed.

that it is not reflected on the host's mount namespace

I recommend testing out mounting /proc from a new mount namespace. Look at what happens to /proc on the host.

Oops, I think I see what you're saying now. I added some clarity about specifying an unshare flag for mount namespace so new mounts don't show up within the host's mount namespace. Also, added clarity that the JobExecutorPath should create a new directory to mount a new proc filesystem to.

Please let me know if that's what you were referring to. Sorry for the confusion - I definitely wasn't clear here.

If you're referring to another step that I'm missing here then I definitely do not know what I'm missing.

You need to mount / in a private subtree within the namespace first, in order to prevent the /proc mount from applying in the host namespace. / is often mounted as a shared subtree.

https://docs.kernel.org/filesystems/sharedsubtree.html

I'm definitely misunderstanding something.

If Start() runs a JobExecutorPath with the following flags:

cmd := exec.Command(JobExecutorPath, arguments...) cmd.SysProcAttr = &syscall.SysProcAttr{ Cloneflags: syscall.CLONE_NEWPID | syscall.CLONE_NEWNS | syscall.CLONE_NEWNET, Unshareflags: syscall.CLONE_NEWNS, }

and then JobExecutorPath mounts a new proc filesystem (syscall.Mount("proc", "/proc", "proc", 0)) then I'm not seeing any issues with the host's mounts or proc file system.

If I drop the Unshareflags then when JobExecutorPath mounts a new proc filesystem then I can confirm the host's proc filesystem is messed up. If the Unshareflags isn't provided then doing what you said (JobExecutorPath mounting / with syscall.MS_PRIVATE|syscall.MS_REC) does prevent messing up the host's proc filesystem as well.

Is there something else I'm missing here? I'm cool adding the mounting of / with MS_PRIVATE|MS_REC flags, but I don't understand what it's doing when the Unshareflags is specified.

Edit: I removed the previously added step to create a temp directory for new proc. That's not needed.

Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare(). All good if you're confident it's working 🙂

You got me curious. Running findmnt -o TARGET,PROPAGATION in the host mount namespace shows / is mounted as shared.

This is also true when running in the new mount namespace without specifying the unshare flag. If the unshare flag for CLONE_NEWNS is provided then running findmnt -o TARGET,PROPAGATION within the new mount namespace shows / is mounted as private.

I'll try to look around for some documentation to back this up, mostly for my own curiosity.

sclevine · 2024-07-25T04:43:30Z

rfd/0001-linux-job-worker-service.md

+
+`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information.
+
+##### func (*Job) Stream(context.Context) <-chan []byte


Is a channel the best choice for this API? What if I want to consume no more than a specific number of bytes? What if I want to stop receiving data?

The goal of the context was to cancel/stop receiving data.

I need to think on this. I will definitely admit an API using a context and returning a channel feels very clunky/unintuitive.

an API using a context and returning a channel feels very clunky/unintuitive

Exactly. Just focusing on the API design, how would you usually stream bytes in Go?

I've updated the docs for Stream to no longer take any parameters and now returns an OutputReader that implements io.Reader (and ultimately is able to read a process's output from another type, Output). I've added more information about OutputReader and Output towards the bottom of the RFD.

Thank you for calling this out! This is a way better design that doesn't require the user to close or cancel anything.

Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.

rfd/0001-linux-job-worker-service.md

tcsc

Looks good to me in general - just a few questions in the comments.

The only major thing I can think of is that I'd like to see a brief discussion of what the glue between the GRPC API and the Job library will look like. Things like how the server application will

track the running/finished jobs,
extract the users CN from the client certificate and use that for authorization
handle concurrent job start requests and suchlike.

rfd/0001-linux-job-worker-service.md

tcsc · 2024-07-26T01:44:22Z

rfd/0001-linux-job-worker-service.md

+
+##### func (*Job) Stop() error
+
+`Stop` will kill the process.


How will the job processes be killed? Will the jobs be given a grace period to finish up once signalled?

I added a note that the process will be killed via kill. Since JobExecutorPath is PID 1 in the new pid namespace when it terminates all other processes running in the new pid namespace are also killed.

I also added a note that graceful termination could be handled in the future by sending SIGTERM to JobExuectorPath, which could handle SIGTERM by sending a SIGTERM signal to the child process (the user's desired command). If the process has not terminated in some amount of time then SIGKILL is used.

Please let me know if that's cool with you or if you would like to see graceful termination supported in this.

Having Stop() send SIGKILL seems unusual, especially since your RFD calls out support for long-running processes explicitly. Many commands are not intended to be killed by the kernel without any notice.

It's definitely important not to go too far out of scope for the challenge, but we expect the job lifecycle management to be relatively robust 🙂

Sounds good. I'll update the RFD so that the library's Stop sends SIGTERM to the JobExecutorPath (and appropriately send SIGTERM to child process), wait some amount of time (probably something like 10 seconds as a hardcoded value for this), check if PID 1 is alive, and if it is then send SIGKILL.

This sounds great! And of course, think about how you'll deal with child processes.

Remember that you're providing the logic for JobExecutorPath, so if it receives SIGTERM, you'll want to specify how it handles that signal. If it quits immediately, all processes will get SIGKILL.

I updated the RFD to handle graceful shutdowns.

JobExecutorPath is now expected to handle SIGTERM. When it receives SIGTERM, it will send SIGTERM to the child process (user's command). JobExecutorPath must not terminate until the child process does.

On the library side, Stop will send SIGTERM to JobExecutorPath. If after 10 seconds, JobExecutorPath is alive then Stop will send SIGKILL to JobExecutorPath.

Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()

This prevents the host mount namespace from seeing the new proc filesystem mounted within the new mount namespace.

Add note that graceful termination could be supported by 1. Send a SIGTERM to JobExecutorPath 1. JobExecutorPath handles SIGTERM by sending SIGTERM to chid process (user's command) 1. Wait some amount of time 1. Send SIGKILL if JobExecutorPath is still around. 1. Since JobExecutorPath is PID 1 all processes in the new pid namespace will be killed.

dustinspecker · 2024-07-28T16:23:26Z

Looks good to me in general - just a few questions in the comments.

The only major thing I can think of is that I'd like to see a brief discussion of what the glue between the GRPC API and the Job library will look like. Things like how the server application will

track the running/finished jobs,

extract the users CN from the client certificate and use that for authorization

handle concurrent job start requests and suchlike.

I've added a few lines on how the server works.

server can store jobs in a map, preferably a sync.Map
I've added information about how the CN can be extracted in each handler via the context.
I've added a couple of notes about gRPC handlers each running in a new goroutine and the "job store" needs to be safe for concurrent use. I also added a note in the library that the Job struct should be safe for concurrent changes as well.

Thank you so much for the feedback!

Returning an OutputReader that implements io.Reader enables the user to decide how much data they want to read and when they want to be done reading without an unintuitive API of the previous design with contexts and channels.

Stop sends SIGTERM to `JobExecutorPath`. Waits 10 seconds. Then sends SIGKILL. JobExecutorPath should handle SIGTERM. Send SIGTERM to child process. Wait for child process to terminate before exiting.

tcsc

Just a few more nits

rfd/0001-linux-job-worker-service.md

This has more type saftey than sync.Map's any usage.

dustinspecker · 2024-07-29T11:46:32Z

Just a few more nits

All great points - thank you for taking the time to make a thorough review!

sclevine · 2024-07-29T14:37:10Z

rfd/0001-linux-job-worker-service.md

+
+The Job's `Stream` function will create a new `OutputReader` per invocation that is provided the Job's `Output`.
+
+`OutputReader` will implement [io.Reader](https://pkg.go.dev/io#Reader). `OutputReader`'s `Read` will invoke the underlying `Output`'s `ReadAt` to read the next bytes.


Note that you may need an io.ReadCloser here, if you store bytes for each client. You can use as much memory as you'd like, but you can't leak memory from clients that have finished streaming.

sclevine · 2024-07-29T15:05:20Z

rfd/0001-linux-job-worker-service.md

+
+##### func (*Job) Stop() error
+
+`Stop` will kill the process.


Sounds great. Note that you may have multiple processes under JobExecutorPath, and gracefully terminating all of them may not be as simple as cmd.Wait()

sclevine · 2024-07-29T15:06:54Z

rfd/0001-linux-job-worker-service.md

+
+`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information.
+
+##### func (*Job) Stream(context.Context) <-chan []byte


Much better. I might lean towards returning io.Reader or io.ReadCloser directly instead of a custom type, but either way works. A custom type is more open to extension in the future.

sclevine · 2024-07-29T15:10:27Z

rfd/0001-linux-job-worker-service.md

+`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above.
+
+The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following:
+


Interesting! Last time I tested this, mounting / as private was necessary to escape the shared subtree. Could be distro-specific (I think systemd on some distros will make / private by default), or maybe this has always worked with unshare(). All good if you're confident it's working 🙂

sclevine

Wait for @tcsc to approve before merging

docs: add RFD for linux job worker service

cdd3f29

dustinspecker force-pushed the rfd branch from b86ccd4 to cdd3f29 Compare July 24, 2024 21:21

dustinspecker added 2 commits July 24, 2024 16:37

docs(rfd): fix typo of share to unshare

9fb9990

docs(rfd): add one line about protobuf

abbe8fd

sclevine reviewed Jul 25, 2024

View reviewed changes

docs(rfd): clarify how JobConfig values are used

bc15644

dustinspecker self-assigned this Jul 25, 2024

dustinspecker added 4 commits July 25, 2024 16:17

docs(rfd): add UseCgroupFD and CgroupFD note

4be8a6b

docs(rfd): add possible proc functions to library

1d4985e

docs(rfd): note cgroup.procs path might not be required

bf59308

docs(rfd): clarify cgroup v2 config and step-by-step

2f6589a

tcsc reviewed Jul 26, 2024

View reviewed changes

dustinspecker added 6 commits July 26, 2024 15:37

docs(rfd): clarify unshare flag for mount

aabd9c1

This prevents the host mount namespace from seeing the new proc filesystem mounted within the new mount namespace.

docs(rfd): JobExecutorPath no longer adds PID to cgroup

dbbd7af

docs(rfd): add high level what the server does

6a583da

docs(rfd): doc that handlers can get CN from context

53699ea

docs(rfd): add note about handling concurrency

86f1be7

dustinspecker added 5 commits July 28, 2024 12:09

docs(rfd): add cgroup test scenarios

2b006be

docs(rfd): add validation logic for cgroup tests

7f733f1

docs(rfd): return OutputReader instead of channel

6a340c9

Returning an OutputReader that implements io.Reader enables the user to decide how much data they want to read and when they want to be done reading without an unintuitive API of the previous design with contexts and channels.

docs(rfd): support graceful shutdowns

ac1bb2d

Stop sends SIGTERM to `JobExecutorPath`. Waits 10 seconds. Then sends SIGKILL. JobExecutorPath should handle SIGTERM. Send SIGTERM to child process. Wait for child process to terminate before exiting.

docs(rfd): remove creating new directory for proc

222624d

tcsc reviewed Jul 29, 2024

View reviewed changes

rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved

rfd/0001-linux-job-worker-service.md Show resolved Hide resolved

rfd/0001-linux-job-worker-service.md Outdated Show resolved Hide resolved

dustinspecker added 3 commits July 29, 2024 06:35

docs(rfd): switch to map with mutex

7b83046

This has more type saftey than sync.Map's any usage.

docs(rfd): elaborate JobExuectorPath technique

86b6b4e

docs(rfd): clarify JobExecutorPath binary

229544f

sclevine reviewed Jul 29, 2024

View reviewed changes

sclevine approved these changes Jul 29, 2024

View reviewed changes

tcsc approved these changes Jul 29, 2024

View reviewed changes

dustinspecker merged commit 3be3399 into main Jul 30, 2024

dustinspecker deleted the rfd branch July 30, 2024 17:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add RFD for linux job worker service #1

docs: add RFD for linux job worker service #1

dustinspecker commented Jul 24, 2024

sclevine Jul 25, 2024

dustinspecker Jul 26, 2024

sclevine Jul 26, 2024

dustinspecker Jul 26, 2024

sclevine Jul 26, 2024

dustinspecker Jul 29, 2024 •

edited

Loading

sclevine Jul 29, 2024

dustinspecker Jul 29, 2024

sclevine Jul 25, 2024

dustinspecker Jul 26, 2024

sclevine Jul 26, 2024

dustinspecker Jul 28, 2024

sclevine Jul 29, 2024

tcsc left a comment •

edited

Loading

tcsc Jul 26, 2024

dustinspecker Jul 26, 2024

sclevine Jul 26, 2024

dustinspecker Jul 26, 2024

sclevine Jul 26, 2024

dustinspecker Jul 29, 2024

sclevine Jul 29, 2024

dustinspecker commented Jul 28, 2024

tcsc left a comment

dustinspecker commented Jul 29, 2024

sclevine Jul 29, 2024

sclevine Jul 29, 2024

sclevine Jul 29, 2024

sclevine Jul 29, 2024

sclevine left a comment

		`Start` will also create a new cgroup for the job and set the CPU, memory, and IO limits. It will be up to the `jobExecutorPath` to append the PID to the cgroup's `cgroup.procs` file as described above.

		The configured `exec.Cmd` will execute `jobExecutorPath`. It is expected that `jobExecutorPath` is a binary that will do the following:


		`Status` returns the current status of the job. Refer to [JobStatus](#type-jobstatus) for more information.

		##### func (*Job) Stream(context.Context) <-chan []byte


		The Job's `Stream` function will create a new `OutputReader` per invocation that is provided the Job's `Output`.

		`OutputReader` will implement [io.Reader](https://pkg.go.dev/io#Reader). `OutputReader`'s `Read` will invoke the underlying `Output`'s `ReadAt` to read the next bytes.

docs: add RFD for linux job worker service #1

docs: add RFD for linux job worker service #1

Conversation

dustinspecker commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dustinspecker Jul 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tcsc left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dustinspecker commented Jul 28, 2024

tcsc left a comment

Choose a reason for hiding this comment

dustinspecker commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sclevine left a comment

Choose a reason for hiding this comment

dustinspecker Jul 29, 2024 •

edited

Loading

tcsc left a comment •

edited

Loading