-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Nomad version
Nomad v1.3.0-rc.1 (31b0a18)
Operating system and Environment details
Ubuntu 22.04 Jammy Jellyfish 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Issue
Seems like something wonky happens with cgroup v2 support. If I create a job that exits immediately, it stops being restarted. Nomad 1.2.3 (the last version I can use because of the plugin breakage #12071) seems to work fine, although I plan on downgrading nomad again to double check.
Given the job file included:
nomad ui for the allocation shows:
May 03, '22 18:24:45 -0600 | Alloc Unhealthy | Task not running by deadline
May 03, '22 18:19:45 -0600 | Restarting | Task restarting in 1.09865699s
May 03, '22 18:19:45 -0600 | Terminated | Exit Code: 0
May 03, '22 18:19:45 -0600 | Started | Task started by client
May 03, '22 18:15:37 -0600 | Restarting | Task restarting in 1.165452572s
May 03, '22 18:15:37 -0600 | Terminated | Exit Code: 0
May 03, '22 18:15:37 -0600 | Started | Task started by client
May 03, '22 18:15:36 -0600 | Task Setup | Building Task Directory
May 03, '22 18:15:36 -0600 | Received | Task received by client
It is currently 18:26, no other restart attempts have been made. the logmon process for the alloc is still running, no processes underneath that or using the allocation dir according to lsof -n +D
If I change the constraint to a ubuntu 20.04 host, it restarts every secondish as expected.
Time Type Description
May 03, '22 18:32:32 -0600 Restarting Task restarting in 1.018849262s
May 03, '22 18:32:32 -0600 Terminated Exit Code: 0
May 03, '22 18:32:32 -0600 Started Task started by client
May 03, '22 18:32:30 -0600 Restarting Task restarting in 1.234701267s
May 03, '22 18:32:30 -0600 Terminated Exit Code: 0
May 03, '22 18:32:30 -0600 Started Task started by client
May 03, '22 18:32:29 -0600 Restarting Task restarting in 1.196971407s
May 03, '22 18:32:29 -0600 Terminated Exit Code: 0
May 03, '22 18:32:29 -0600 Started Task started by client
May 03, '22 18:32:28 -0600 Restarting Task restarting in 1.107809535s
... many more snipped
Other issues I have not been able to reproduce with any success:
[ERROR] client.cpuset.v2: failed to set cgroup: path=/sys/fs/cgroup/nomad.slice/eb0be10c-0359-00cc-915b-c8ecae499c19.run.scope err="openat2 /sys/fs/cgroup/nomad.slice/eb0be10c-0359-00cc-915b-c8ecae499c19.run.scope/cpuset.cpus: no such file or directory"
[ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=4db29e19-c6d2-832b-b30e-a1b6f7f62d53 task=run error="failed to launch command with executor: rpc error: code = Unknown desc = failed to set v2 cgroup resources: failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory"
Also, Might be a bug in with the job, but .. /dev/null seems to disappear. edit: somtimes, for some jobs, but not all the time, this is how I noticed restarts were not, err, restarting. Trying to debug this issue I'm still working to nail this down, feels like it might be related. This is a raw_exec job that make their own restricted mount namespace, it includes /dev/null is and it is writable. Seems to work fine on nomad 1.2.3 on the same host
PermissionError: [Errno 1] Operation not permitted: '/dev/null'
Job file (if appropriate)
job "test-env" {
datacenters = ["cd01"]
type = "service"
constraint {
attribute = "${attr.unique.hostname}"
value = "..."
}
group "group" {
restart {
attempts = 5
mode = "delay"
delay = "1s"
interval = "5s"
}
task "try" {
driver = "raw_exec"
config {
command = "/usr/bin/bash"
# see if /dev/null disappears
args = [ "-c", "dd if=/dev/zero of=/dev/null count=1 || (echo \"busted\"; sleep 1000)"]
}
}
}
}
Metadata
Metadata
Assignees
Labels
Type
Projects
Status