Help with "Cluster" label and "-c" flag #15

mattcprotzman · 2024-07-26T16:13:51Z

Hey friends! AMAZING work - I'm so thrilled to be implementing this on our system.

It seems that when translating this to our system, the word "cluster" could be used interchangeably with slurm queues. We separate our nodes into different queues - for example, the "a100" queue has nodes gpu[01-18], where the ica100 queue has icgpu[01-10].

I've separated this in the prometheus config:

 "labels": {
      "cluster": "a100"
    },
    "targets": [
      "gpu02:9100",
.
.
.
  "labels": {
      "cluster": "ica100"
    },
    "targets": [
      "icgpu01:9100",
.

My config.py file for jobstats looks like:

# translate cluster names in Slurm DB to informal names
CLUSTER_TRANS = {"a100":"slurm"}
#CLUSTER_TRANS = {}  # if no translations then use an empty dictionary
CLUSTER_TRANS_INV = dict(zip(CLUSTER_TRANS.values(), CLUSTER_TRANS.keys()))

I've experimented with various different CLUSTER_TRANS options to try and get this to work including (obviously not all at one time):

CLUSTER_TRANS = {"a100":"slurm"}
CLUSTER_TRANS = {"ica100":"slurm"}
CLUSTER_TRANS = {"a100":"slurm","ica100":"slurm"}
CLUSTER_TRANS = {}

It seems like I'll need to use the -c flag in order to specify which queue|cluster the job ran on. For example:
jobstats -c ica100 $ICA100JOBID
jobstats -c a100 $A100JOBID
etc.

However, if I use the -c flag at command line, I run into an issue where if the job I'm specifying used the ica100 queue|cluster while the config says a100, then I can't get it to work, and vice versa. Having both in the dictionary also doesn't seem to work.

For example, 15373156 is a job running on ica100

With this config:
CLUSTER_TRANS = {"a100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 54, in <module>
    color=color)
  File "/apps/jobstats/jobstats/jobstats.py", line 63, in __init__
    if not self.__get_job_info():
  File "/apps/jobstats/jobstats/jobstats.py", line 178, in __get_job_info
    self.error(f"Failed to lookup jobid %s on {clstr}. Make sure you specified the correct cluster." % self.jobid)
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: Failed to lookup jobid 15373156 on ica100. Make sure you specified the correct cluster.

With this config:
CLUSTER_TRANS = {}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 54, in <module>
    color=color)
  File "/apps/jobstats/jobstats/jobstats.py", line 63, in __init__
    if not self.__get_job_info():
  File "/apps/jobstats/jobstats/jobstats.py", line 178, in __get_job_info
    self.error(f"Failed to lookup jobid %s on {clstr}. Make sure you specified the correct cluster." % self.jobid)
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: Failed to lookup jobid 15373156 on ica100. Make sure you specified the correct cluster.

With this config:
CLUSTER_TRANS = {"ica100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
DEBUG: jobidraw=15373156, start=1721966681, end=1721966921, cluster=slurm, tres=billing=16,cpu=4,gres/gpu=1,mem=16000M,node=1, data=, user=scrubbed, account=scrubbed state=COMPLETED, timelimit=240, nodes=1, ncpus=16, reqmem=16000M, qos=scrubbed, partition=ica100, jobname=scrubbed
DEBUG: jobid=15373156, jobidraw=15373156, start=1721966681, end=1721966921, gpus=1, diff=240, cluster=ica100, data=, timelimitraw=240
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '67108864000']}]}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '886964224']}]}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '186.529817411']}]}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '16']}]}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '85899345920']}]}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '1621295104']}]}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '9.75']}]}}

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 15373156
  NetID/Account: scrubbed
       Job Name: scrubbed
          State: COMPLETED
          Nodes: 1
      CPU Cores: 16
     CPU Memory: 16GB (1GB per CPU-core)
           GPUs: 1
  QOS/Partition: scrubbed/ica100
        Cluster: ica100
     Start Time: Fri Jul 26, 2024 at 12:04 AM
       Run Time: 00:04:00
     Time Limit: 04:00:00

                              Overall Utilization
================================================================================
  CPU utilization  [||                                              5%]
  CPU memory usage [                                                1%]
  GPU utilization  [|||||                                          10%]
  GPU memory usage [|                                               2%]

                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      icgpu02: 00:03:06/01:04:00 (efficiency=4.9%)

  CPU memory usage per node - used/allocated
      icgpu02: 845.9MB/62.5GB (52.9MB/3.9GB per core of 16)

  GPU utilization per node
      icgpu02 (GPU 0): 9.8%

  GPU memory usage per node - maximum used/total
      icgpu02 (GPU 0): 1.5GB/80.0GB (1.9%)

                                     Notes
================================================================================
  * The overall GPU utilization of this job is only 10%. This value is low
    compared to the cluster mean value of 50%. Please investigate the reason
    for the low utilization. For more info:
      https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#util

  * Have a nice day!

With this config
CLUSTER_TRANS = {"ica100":"slurm","a100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
DEBUG: jobidraw=15373156, start=1721966681, end=1721966921, cluster=slurm, tres=billing=16,cpu=4,gres/gpu=1,mem=16000M,node=1, data=, user=scrubbed, account=scrubbed, state=COMPLETED, timelimit=240, nodes=1, ncpus=16, reqmem=16000M, qos=scrubbed, partition=ica100, jobname=scrubbed
DEBUG: jobid=15373156, jobidraw=15373156, start=1721966681, end=1721966921, gpus=1, diff=240, cluster=a100, data=, timelimitraw=240
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 58, in <module>
    stats.report_job()
  File "/apps/jobstats/jobstats/jobstats.py", line 586, in report_job
    +f'If the run time was very short then try running "seff {self.jobid}".')
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: No stats found for job 15373156, either because it is too old or because
it expired from jobstats database. If you are not running this command on the
cluster where the job was run then use the -c option to specify the cluster.
If the run time was very short then try running "seff 15373156".

So, ultimately the question is am I doing something wrong in the prometheus config where I'm setting up the labels for the different queues? I don't think that I am because it does work if I specify it in the dictionary correctly.
or
Am I doing something wrong in the translation dictionary? Is there a way to set this up so that I don't need to provide a "queue" or -c flag for jobs that run in different partitions/queues?

Thank you in advance for any/all help. We are looking so forward to be able to provide this extra insight for our users.

The text was updated successfully, but these errors were encountered:

plazonic · 2024-10-19T22:33:33Z

Sorry for late reply.

I don't really understand why you are attempting to do this - the only reason we have a cluster label is because we share the same prometheus server across a few different clusters. Therefore a cluster label is necessary to ensure that jobid data can be uniquely distinguished across those clusters. Otherwise there is no need to put queues or partitions into prometheus as labels - if you want to do it as extra labels, go ahead, but this is job oriented and cluster label should remain the same across that slurm cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with "Cluster" label and "-c" flag #15

Help with "Cluster" label and "-c" flag #15

mattcprotzman commented Jul 26, 2024 •

edited

Loading

plazonic commented Oct 19, 2024

Help with "Cluster" label and "-c" flag #15

Help with "Cluster" label and "-c" flag #15

Comments

mattcprotzman commented Jul 26, 2024 • edited Loading

plazonic commented Oct 19, 2024

mattcprotzman commented Jul 26, 2024 •

edited

Loading