Skip to content

Unable to confirm slurm scheduler is working with kind script #3

Closed
@kannon92

Description

@kannon92

So this is my first time playing around with these projects.

I ran hack/kind.sh and things all seemed to work correctly.

k get pods -A

~/Work/slurm-bridge/hack$ k get pods -A
NAMESPACE            NAME                                            READY   STATUS    RESTARTS   AGE
cert-manager         cert-manager-58dd99f969-kqc4n                   1/1     Running   0          24m
cert-manager         cert-manager-cainjector-55cd9f77b5-4dbzc        1/1     Running   0          24m
cert-manager         cert-manager-webhook-7987476d56-fggz9           1/1     Running   0          24m
jobset-system        jobset-controller-7db89b4dd7-tctqb              1/1     Running   0          22m
kube-system          coredns-674b8bbfcf-4csjc                        1/1     Running   0          31m
kube-system          coredns-674b8bbfcf-gd84f                        1/1     Running   0          31m
kube-system          etcd-kind-control-plane                         1/1     Running   0          31m
kube-system          kindnet-dnvwp                                   1/1     Running   0          31m
kube-system          kindnet-j59q4                                   1/1     Running   0          31m
kube-system          kindnet-ktf9k                                   1/1     Running   0          31m
kube-system          kindnet-pg4ns                                   1/1     Running   0          31m
kube-system          kindnet-r74jm                                   1/1     Running   0          31m
kube-system          kindnet-wtxwk                                   1/1     Running   0          31m
kube-system          kindnet-zkpkq                                   1/1     Running   0          31m
kube-system          kindnet-zqvsj                                   1/1     Running   0          31m
kube-system          kube-apiserver-kind-control-plane               1/1     Running   0          31m
kube-system          kube-controller-manager-kind-control-plane      1/1     Running   0          31m
kube-system          kube-proxy-89477                                1/1     Running   0          31m
kube-system          kube-proxy-dtn4t                                1/1     Running   0          31m
kube-system          kube-proxy-hmcdz                                1/1     Running   0          31m
kube-system          kube-proxy-lhl2s                                1/1     Running   0          31m
kube-system          kube-proxy-prw2n                                1/1     Running   0          31m
kube-system          kube-proxy-tmjlw                                1/1     Running   0          31m
kube-system          kube-proxy-vbbb8                                1/1     Running   0          31m
kube-system          kube-proxy-x4xrg                                1/1     Running   0          31m
kube-system          kube-scheduler-kind-control-plane               1/1     Running   0          31m
local-path-storage   local-path-provisioner-7dc846544d-4dr76         1/1     Running   0          31m
scheduler-plugins    scheduler-plugins-controller-79bcf99c68-tkq5m   1/1     Running   0          22m
slinky               slurm-operator-bb5c58dc6-bvq67                  1/1     Running   0          22m
slinky               slurm-operator-webhook-87bc59884-26cpr          1/1     Running   0          22m
slurm-bridge         job-sleep-large-89tjs                           1/1     Running   0          3m51s
slurm-bridge         job-sleep-large-jzt2n                           1/1     Running   0          3m51s
slurm                slurm-accounting-0                              1/1     Running   0          22m
slurm                slurm-compute-slurm-bridge-0                    2/2     Running   0          22m
slurm                slurm-compute-slurm-bridge-1                    2/2     Running   0          22m
slurm                slurm-compute-slurm-bridge-2                    2/2     Running   0          22m
slurm                slurm-controller-0                              3/3     Running   0          22m
slurm                slurm-exporter-74f65b46b9-ptj4v                 1/1     Running   0          22m
slurm                slurm-mariadb-0                                 1/1     Running   0          22m
slurm                slurm-restapi-57cd76c785-4hkxx                  1/1     Running   0          22m

I am able to submit Job and JobSet and they run without issue. But I don't see anything in the slurm account so I am wondering if slurm is actually intercepting these workloads.

If I run hack/bridge_watch.sh, I see:

SLURM PODS
NAME                           READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
slurm-compute-slurm-bridge-0   2/2     Running   0          27m   10.244.3.6   kind-worker5   <none>           <none>
slurm-compute-slurm-bridge-1   2/2     Running   0          27m   10.244.9.4   kind-worker6   <none>           <none>
slurm-compute-slurm-bridge-2   2/2     Running   0          27m   10.244.7.7   kind-worker7   <none>           <none>

SLURM BRIDGE PODS
NAME                    READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
job-sleep-large-89tjs   1/1     Running   0          8m43s   10.244.8.11   kind-worker4   <none>           <none>
job-sleep-large-jzt2n   1/1     Running   0          8m43s   10.244.3.13   kind-worker5   <none>           <none>

PODGROUP STATUS
No resources found in slurm-bridge namespace.

JOB STATUS
NAME              STATUS    COMPLETIONS   DURATION   AGE
job-sleep-large   Running   0/3           8m43s      8m43s

JOBSET STATUS
No resources found in slurm-bridge namespace.

SINFO
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
slurm-bridge    up   infinite	   3   idle slurm-bridge-[0-2]
all*            up   infinite	   3   idle slurm-bridge-[0-2]

SQUEUE PENDING
             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)

SQUEUE RUNNING
             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)

SQUEUE COMPLETE
             JOBID PARTITION     NAME     USER ST	TIME  NODES NODELIST(REASON)

Its been a while since I used Slurm but I would expect to see jobs in SQUEUE COMPLETE or SQUEUE RUNNING in this case. So I am thinking something went wrong for my setup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions