Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with "Cluster" label and "-c" flag #15

Open
mattcprotzman opened this issue Jul 26, 2024 · 1 comment
Open

Help with "Cluster" label and "-c" flag #15

mattcprotzman opened this issue Jul 26, 2024 · 1 comment

Comments

@mattcprotzman
Copy link

mattcprotzman commented Jul 26, 2024

Hey friends! AMAZING work - I'm so thrilled to be implementing this on our system.

It seems that when translating this to our system, the word "cluster" could be used interchangeably with slurm queues. We separate our nodes into different queues - for example, the "a100" queue has nodes gpu[01-18], where the ica100 queue has icgpu[01-10].

I've separated this in the prometheus config:

 "labels": {
      "cluster": "a100"
    },
    "targets": [
      "gpu02:9100",
.
.
.
  "labels": {
      "cluster": "ica100"
    },
    "targets": [
      "icgpu01:9100",
.

My config.py file for jobstats looks like:

# translate cluster names in Slurm DB to informal names
CLUSTER_TRANS = {"a100":"slurm"}
#CLUSTER_TRANS = {}  # if no translations then use an empty dictionary
CLUSTER_TRANS_INV = dict(zip(CLUSTER_TRANS.values(), CLUSTER_TRANS.keys()))

I've experimented with various different CLUSTER_TRANS options to try and get this to work including (obviously not all at one time):

CLUSTER_TRANS = {"a100":"slurm"}
CLUSTER_TRANS = {"ica100":"slurm"}
CLUSTER_TRANS = {"a100":"slurm","ica100":"slurm"}
CLUSTER_TRANS = {}

It seems like I'll need to use the -c flag in order to specify which queue|cluster the job ran on. For example:
jobstats -c ica100 $ICA100JOBID
jobstats -c a100 $A100JOBID
etc.

However, if I use the -c flag at command line, I run into an issue where if the job I'm specifying used the ica100 queue|cluster while the config says a100, then I can't get it to work, and vice versa. Having both in the dictionary also doesn't seem to work.

For example, 15373156 is a job running on ica100

With this config:
CLUSTER_TRANS = {"a100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 54, in <module>
    color=color)
  File "/apps/jobstats/jobstats/jobstats.py", line 63, in __init__
    if not self.__get_job_info():
  File "/apps/jobstats/jobstats/jobstats.py", line 178, in __get_job_info
    self.error(f"Failed to lookup jobid %s on {clstr}. Make sure you specified the correct cluster." % self.jobid)
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: Failed to lookup jobid 15373156 on ica100. Make sure you specified the correct cluster.

With this config:
CLUSTER_TRANS = {}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 54, in <module>
    color=color)
  File "/apps/jobstats/jobstats/jobstats.py", line 63, in __init__
    if not self.__get_job_info():
  File "/apps/jobstats/jobstats/jobstats.py", line 178, in __get_job_info
    self.error(f"Failed to lookup jobid %s on {clstr}. Make sure you specified the correct cluster." % self.jobid)
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: Failed to lookup jobid 15373156 on ica100. Make sure you specified the correct cluster.

With this config:
CLUSTER_TRANS = {"ica100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
DEBUG: jobidraw=15373156, start=1721966681, end=1721966921, cluster=slurm, tres=billing=16,cpu=4,gres/gpu=1,mem=16000M,node=1, data=, user=scrubbed, account=scrubbed state=COMPLETED, timelimit=240, nodes=1, ncpus=16, reqmem=16000M, qos=scrubbed, partition=ica100, jobname=scrubbed
DEBUG: jobid=15373156, jobidraw=15373156, start=1721966681, end=1721966921, gpus=1, diff=240, cluster=ica100, data=, timelimitraw=240
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '67108864000']}]}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '886964224']}]}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '186.529817411']}]}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='ica100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cgroup': '/slurm/uid_3968/job_15373156', 'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes'}, 'value': [1721966921, '16']}]}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '85899345920']}]}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '1621295104']}]}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='ica100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': [{'metric': {'cluster': 'ica100', 'instance': 'icgpu02', 'job': 'Rockfish GPU Nodes', 'minor_number': '0', 'name': 'NVIDIA A100 80GB PCIe', 'ordinal': '0', 'uuid': 'GPU-c0419cd1-5928-47f9-6c9d-f4e8fccce0ad'}, 'value': [1721966921, '9.75']}]}}

================================================================================
                              Slurm Job Statistics
================================================================================
         Job ID: 15373156
  NetID/Account: scrubbed
       Job Name: scrubbed
          State: COMPLETED
          Nodes: 1
      CPU Cores: 16
     CPU Memory: 16GB (1GB per CPU-core)
           GPUs: 1
  QOS/Partition: scrubbed/ica100
        Cluster: ica100
     Start Time: Fri Jul 26, 2024 at 12:04 AM
       Run Time: 00:04:00
     Time Limit: 04:00:00

                              Overall Utilization
================================================================================
  CPU utilization  [||                                              5%]
  CPU memory usage [                                                1%]
  GPU utilization  [|||||                                          10%]
  GPU memory usage [|                                               2%]

                              Detailed Utilization
================================================================================
  CPU utilization per node (CPU time used/run time)
      icgpu02: 00:03:06/01:04:00 (efficiency=4.9%)

  CPU memory usage per node - used/allocated
      icgpu02: 845.9MB/62.5GB (52.9MB/3.9GB per core of 16)

  GPU utilization per node
      icgpu02 (GPU 0): 9.8%

  GPU memory usage per node - maximum used/total
      icgpu02 (GPU 0): 1.5GB/80.0GB (1.9%)

                                     Notes
================================================================================
  * The overall GPU utilization of this job is only 10%. This value is low
    compared to the cluster mean value of 50%. Please investigate the reason
    for the low utilization. For more info:
      https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing#util

  * Have a nice day!

With this config
CLUSTER_TRANS = {"ica100":"slurm","a100":"slurm"}

[root@login01 jobstats]# jobstats -d -c ica100 15373156
DEBUG: jobidraw=15373156, start=1721966681, end=1721966921, cluster=slurm, tres=billing=16,cpu=4,gres/gpu=1,mem=16000M,node=1, data=, user=scrubbed, account=scrubbed, state=COMPLETED, timelimit=240, nodes=1, ncpus=16, reqmem=16000M, qos=scrubbed, partition=ica100, jobname=scrubbed
DEBUG: jobid=15373156, jobidraw=15373156, start=1721966681, end=1721966921, gpus=1, diff=240, cluster=a100, data=, timelimitraw=240
DEBUG: query=max_over_time(cgroup_memory_total_bytes{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_memory_rss_bytes{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpu_total_seconds{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time(cgroup_cpus{cluster='a100',cgroup=~'.*15373156',step='',task=''}[240s]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_total_bytes{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=max_over_time((nvidia_gpu_memory_used_bytes{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
DEBUG: query=avg_over_time((nvidia_gpu_duty_cycle{cluster='a100'} and nvidia_gpu_jobId == 15373156)[240s:]), time=1721966921
DEBUG: query result={'status': 'success', 'data': {'resultType': 'vector', 'result': []}}
Traceback (most recent call last):
  File "/apps/jobstats/jobstats/jobstats", line 58, in <module>
    stats.report_job()
  File "/apps/jobstats/jobstats/jobstats.py", line 586, in report_job
    +f'If the run time was very short then try running "seff {self.jobid}".')
  File "/apps/jobstats/jobstats/jobstats.py", line 115, in error
    raise Exception(msg)
Exception: No stats found for job 15373156, either because it is too old or because
it expired from jobstats database. If you are not running this command on the
cluster where the job was run then use the -c option to specify the cluster.
If the run time was very short then try running "seff 15373156".

So, ultimately the question is am I doing something wrong in the prometheus config where I'm setting up the labels for the different queues? I don't think that I am because it does work if I specify it in the dictionary correctly.
or
Am I doing something wrong in the translation dictionary? Is there a way to set this up so that I don't need to provide a "queue" or -c flag for jobs that run in different partitions/queues?

Thank you in advance for any/all help. We are looking so forward to be able to provide this extra insight for our users.

@plazonic
Copy link
Collaborator

Sorry for late reply.

I don't really understand why you are attempting to do this - the only reason we have a cluster label is because we share the same prometheus server across a few different clusters. Therefore a cluster label is necessary to ensure that jobid data can be uniquely distinguished across those clusters. Otherwise there is no need to put queues or partitions into prometheus as labels - if you want to do it as extra labels, go ahead, but this is job oriented and cluster label should remain the same across that slurm cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants