Closed
Description
So this is my first time playing around with these projects.
I ran hack/kind.sh and things all seemed to work correctly.
k get pods -A
~/Work/slurm-bridge/hack$ k get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-58dd99f969-kqc4n 1/1 Running 0 24m
cert-manager cert-manager-cainjector-55cd9f77b5-4dbzc 1/1 Running 0 24m
cert-manager cert-manager-webhook-7987476d56-fggz9 1/1 Running 0 24m
jobset-system jobset-controller-7db89b4dd7-tctqb 1/1 Running 0 22m
kube-system coredns-674b8bbfcf-4csjc 1/1 Running 0 31m
kube-system coredns-674b8bbfcf-gd84f 1/1 Running 0 31m
kube-system etcd-kind-control-plane 1/1 Running 0 31m
kube-system kindnet-dnvwp 1/1 Running 0 31m
kube-system kindnet-j59q4 1/1 Running 0 31m
kube-system kindnet-ktf9k 1/1 Running 0 31m
kube-system kindnet-pg4ns 1/1 Running 0 31m
kube-system kindnet-r74jm 1/1 Running 0 31m
kube-system kindnet-wtxwk 1/1 Running 0 31m
kube-system kindnet-zkpkq 1/1 Running 0 31m
kube-system kindnet-zqvsj 1/1 Running 0 31m
kube-system kube-apiserver-kind-control-plane 1/1 Running 0 31m
kube-system kube-controller-manager-kind-control-plane 1/1 Running 0 31m
kube-system kube-proxy-89477 1/1 Running 0 31m
kube-system kube-proxy-dtn4t 1/1 Running 0 31m
kube-system kube-proxy-hmcdz 1/1 Running 0 31m
kube-system kube-proxy-lhl2s 1/1 Running 0 31m
kube-system kube-proxy-prw2n 1/1 Running 0 31m
kube-system kube-proxy-tmjlw 1/1 Running 0 31m
kube-system kube-proxy-vbbb8 1/1 Running 0 31m
kube-system kube-proxy-x4xrg 1/1 Running 0 31m
kube-system kube-scheduler-kind-control-plane 1/1 Running 0 31m
local-path-storage local-path-provisioner-7dc846544d-4dr76 1/1 Running 0 31m
scheduler-plugins scheduler-plugins-controller-79bcf99c68-tkq5m 1/1 Running 0 22m
slinky slurm-operator-bb5c58dc6-bvq67 1/1 Running 0 22m
slinky slurm-operator-webhook-87bc59884-26cpr 1/1 Running 0 22m
slurm-bridge job-sleep-large-89tjs 1/1 Running 0 3m51s
slurm-bridge job-sleep-large-jzt2n 1/1 Running 0 3m51s
slurm slurm-accounting-0 1/1 Running 0 22m
slurm slurm-compute-slurm-bridge-0 2/2 Running 0 22m
slurm slurm-compute-slurm-bridge-1 2/2 Running 0 22m
slurm slurm-compute-slurm-bridge-2 2/2 Running 0 22m
slurm slurm-controller-0 3/3 Running 0 22m
slurm slurm-exporter-74f65b46b9-ptj4v 1/1 Running 0 22m
slurm slurm-mariadb-0 1/1 Running 0 22m
slurm slurm-restapi-57cd76c785-4hkxx 1/1 Running 0 22m
I am able to submit Job and JobSet and they run without issue. But I don't see anything in the slurm account so I am wondering if slurm is actually intercepting these workloads.
If I run hack/bridge_watch.sh
, I see:
SLURM PODS
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
slurm-compute-slurm-bridge-0 2/2 Running 0 27m 10.244.3.6 kind-worker5 <none> <none>
slurm-compute-slurm-bridge-1 2/2 Running 0 27m 10.244.9.4 kind-worker6 <none> <none>
slurm-compute-slurm-bridge-2 2/2 Running 0 27m 10.244.7.7 kind-worker7 <none> <none>
SLURM BRIDGE PODS
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
job-sleep-large-89tjs 1/1 Running 0 8m43s 10.244.8.11 kind-worker4 <none> <none>
job-sleep-large-jzt2n 1/1 Running 0 8m43s 10.244.3.13 kind-worker5 <none> <none>
PODGROUP STATUS
No resources found in slurm-bridge namespace.
JOB STATUS
NAME STATUS COMPLETIONS DURATION AGE
job-sleep-large Running 0/3 8m43s 8m43s
JOBSET STATUS
No resources found in slurm-bridge namespace.
SINFO
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
slurm-bridge up infinite 3 idle slurm-bridge-[0-2]
all* up infinite 3 idle slurm-bridge-[0-2]
SQUEUE PENDING
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
SQUEUE RUNNING
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
SQUEUE COMPLETE
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Its been a while since I used Slurm but I would expect to see jobs in SQUEUE COMPLETE or SQUEUE RUNNING in this case. So I am thinking something went wrong for my setup.
Metadata
Metadata
Assignees
Labels
No labels