A high-performance monitoring system that collects and analyzes SLURM job efficiency metrics, optimized for large-scale HPC environments.
SLURM's accounting database purges detailed job metrics (CPU usage, memory usage) after 30 days. This tool captures and preserves that data in efficient Parquet format for long-term analysis of resource utilization patterns.
- 📊 Captures comprehensive efficiency metrics from all job states
- 💾 Efficient Parquet storage - columnar format optimized for analytics
- 🔄 Smart incremental processing - tracks completed dates to minimize re-processing
- 📈 Rich visualizations - bar charts for resource usage, efficiency, and node utilization
- 👥 Group-based analytics - track usage by research groups/teams
- 🖥️ Node utilization tracking - analyze per-node CPU and GPU usage
- ⚡ Parallel collection - multi-threaded data collection by default
- ⏰ Cron-ready - designed for automated daily collection
- 🎯 Intelligent re-collection - only re-fetches incomplete job states
Table of Contents
For each job:
- Job metadata: ID, user, name, partition, state, node list
- Time info: submit, start, end times, elapsed duration
- Allocated resources: CPUs, memory, GPUs, nodes
- Actual usage: CPU seconds used (TotalCPU), peak memory (MaxRSS)
- Calculated metrics:
- CPU efficiency % (actual CPU time / allocated CPU time)
- Memory efficiency % (peak memory / allocated memory)
- CPU hours wasted
- Memory GB-hours wasted
- Total reserved resources (CPU/GPU/memory hours)
- uv - Python package and project manager (will auto-install dependencies)
- SLURM with accounting enabled
- sacct command access
That's it! The script uses uv inline script dependencies, so all Python packages are automatically installed when you run the script.
# Run directly with uvx (uv tool run)
uvx slurm-usage --help
# Or for a specific command
uvx slurm-usage collect --days 7# Install globally with uv
uv tool install slurm-usage
# Or with pip
pip install slurm-usage
# Then use directly
slurm-usage --help# Clone the repository
git clone https://github.com/basnijholt/slurm-usage
cd slurm-usage
# Run the script directly (dependencies auto-installed by uv)
./slurm_usage.py --help
# Or with Python
python slurm_usage.py --helpThe following commands are available:
Usage: slurm_usage.py [OPTIONS] COMMAND [ARGS]...
SLURM Job Monitor - Collect and analyze job efficiency metrics
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy │
│ it or customize the installation. │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ───────────────────────────────────────────────────────────────────╮
│ collect Collect job data from SLURM using parallel date-based queries. │
│ analyze Analyze collected job data. │
│ status Show monitoring system status. │
│ current Display current cluster usage statistics from squeue. │
│ nodes Display node information from SLURM. │
│ test Run a quick test of the system. │
╰──────────────────────────────────────────────────────────────────────────────╯
# Collect data (uses 4 parallel workers by default)
slurm-usage collect
# Collect last 7 days of data
slurm-usage collect --days 7
# Collect with more parallel workers
slurm-usage collect --n-parallel 8
# Analyze collected data
slurm-usage analyze --days 7
# Display current cluster usage
slurm-usage current
# Display node information
slurm-usage nodes
# Check system status
slurm-usage status
# Test system configuration
slurm-usage testNote: If running from source, use ./slurm_usage.py instead of slurm-usage.
--days/-d: Days to look back (default: 1)--data-dir: Data directory location (default: ./data)--summary/--no-summary: Show analysis after collection (default: True)--n-parallel/-n: Number of parallel workers (default: 4)
--days/-d: Days to analyze (default: 7)--data-dir: Data directory location
--data-dir: Data directory location
Shows real-time cluster utilization from squeue, broken down by user and partition.
Shows information about cluster nodes including CPU and GPU counts.
data/
├── raw/ # Raw SLURM data (archived)
│ ├── 2025-08-19.parquet # Daily raw records
│ ├── 2025-08-20.parquet
│ └── ...
├── processed/ # Processed job metrics
│ ├── 2025-08-19.parquet # Daily processed data
│ ├── 2025-08-20.parquet
│ └── ...
└── .date_completion_tracker.json # Tracks fully processed dates
═══ Resource Usage by User ═══
┌─────────────┬──────┬───────────┬──────────────┬───────────┬─────────┬──────────┐
│ User │ Jobs │ CPU Hours │ Memory GB-hrs│ GPU Hours │ CPU Eff │ Mem Eff │
├─────────────┼──────┼───────────┼──────────────┼───────────┼─────────┼──────────┤
│ alice │ 124 │ 12,847 │ 48,291 │ 1,024 │ 45.2% │ 23.7% │
│ bob │ 87 │ 8,234 │ 31,456 │ 512 │ 38.1% │ 18.4% │
└─────────────┴──────┴───────────┴──────────────┴───────────┴─────────┴──────────┘
═══ Node Usage Analysis ═══
┌────────────┬──────┬───────────┬───────────┬───────────┐
│ Node │ Jobs │ CPU Hours │ GPU Hours │ CPU Util% │
├────────────┼──────┼───────────┼───────────┼───────────┤
│ cluster-1 │ 234 │ 45,678 │ 2,048 │ 74.3% │
│ cluster-2 │ 198 │ 41,234 │ 1,536 │ 67.1% │
└────────────┴──────┴───────────┴───────────┴───────────┘
The monitor intelligently handles job state transitions:
- Complete dates: Once all jobs for a date reach final states (COMPLETED, FAILED, CANCELLED, etc.), the date is marked complete and won't be re-queried
- Incomplete jobs: Jobs in states like RUNNING, PENDING, or SUSPENDED are automatically re-collected on subsequent runs
- Efficient updates: Only changed jobs are updated, minimizing processing time
The following job states indicate a job may change and will trigger re-collection:
- Active:
RUNNING,PENDING,SUSPENDED - Transitional:
COMPLETING,CONFIGURING,STAGE_OUT,SIGNALING - Requeue:
REQUEUED,REQUEUE_FED,REQUEUE_HOLD - Other:
RESIZING,REVOKED,SPECIAL_EXIT
Create a configuration file to define your organization's research groups and optionally specify the data directory. The configuration file is searched in the following locations:
$XDG_CONFIG_HOME/slurm-usage/config.yaml~/.config/slurm-usage/config.yaml/etc/slurm-usage/config.yaml
The data directory for storing collected metrics can be configured in three ways (in order of priority):
-
Command line: Use
--data-dir /path/to/datawith any command (highest priority) -
Configuration file: Set
data_dir: /path/to/datain the config file -
Default: If not specified, data is stored in
./data(current working directory)
This allows flexible deployment:
- Default installation: Data stored in
./datasubdirectory - System-wide deployment: Set
data_dir: /var/lib/slurm-usagein/etc/slurm-usage/config.yaml - Shared installations: Use a network storage path in the config
- Per-run override: Use
--data-dirflag to override for specific commands
Example config.yaml:
# Example configuration file for slurm-usage
# Copy this file to one of the following locations:
# - $XDG_CONFIG_HOME/slurm-usage/config.yaml
# - ~/.config/slurm-usage/config.yaml
# - /etc/slurm-usage/config.yaml (for system-wide configuration)
# Group configuration - organize users into research groups
groups:
physics:
- alice
- bob
- charlie
chemistry:
- david
- eve
- frank
biology:
- grace
- henry
- irene
# Data directory configuration (optional)
# - If not specified or set to null, defaults to ./data (current working directory)
# - Set to an explicit path to use a custom location
# - Useful for shared installations where data should be stored centrally
#
# Examples:
# data_dir: null # Use default ./data directory
# data_dir: /var/lib/slurm-usage # System-wide data directory
# data_dir: /shared/slurm-data # Shared network location
# Add to crontab (runs daily at 2 AM)
crontab -e
# If installed with uv tool or pip:
0 2 * * * /path/to/slurm-usage collect --days 2
# Or if running from source:
0 2 * * * /path/to/slurm-usage/slurm_usage.py collect --days 2| Field | Type | Description |
|---|---|---|
| job_id | str | SLURM job ID |
| user | str | Username |
| job_name | str | Job name (max 50 chars) |
| partition | str | SLURM partition |
| state | str | Final job state |
| submit_time | datetime.datetime | None |
| start_time | datetime.datetime | None |
| end_time | datetime.datetime | None |
| node_list | str | Nodes where job ran |
| elapsed_seconds | int | Runtime in seconds |
| alloc_cpus | int | CPUs allocated |
| req_mem_mb | float | Memory requested (MB) |
| max_rss_mb | float | Peak memory used (MB) |
| total_cpu_seconds | float | Actual CPU time used |
| alloc_gpus | int | GPUs allocated |
| cpu_efficiency | float | CPU efficiency % (0-100) |
| memory_efficiency | float | Memory efficiency % (0-100) |
| cpu_hours_wasted | float | Wasted CPU hours |
| memory_gb_hours_wasted | float | Wasted memory GB-hours |
| cpu_hours_reserved | float | Total CPU hours reserved |
| memory_gb_hours_reserved | float | Total memory GB-hours reserved |
| gpu_hours_reserved | float | Total GPU hours reserved |
| is_complete | bool | Whether job has reached final state |
- Date completion tracking: Dates with only finished jobs are marked complete and skipped
- Parallel collection: Default 4 workers fetch different dates simultaneously
- Smart merging: Only updates changed jobs when re-collecting
- Efficient storage: Parquet format provides ~10x compression over CSV
- Date-based partitioning: Data organized by date for efficient queries
-
30-day window: SLURM purges detailed metrics after 30 days. Run collection at least weekly to ensure no data is lost.
-
Batch steps: Actual usage metrics (TotalCPU, MaxRSS) are stored in the
.batchstep, not the parent job record. -
State normalization: All CANCELLED variants are normalized to "CANCELLED" for consistency.
-
GPU tracking: GPU allocation is extracted from the AllocTRES field.
-
Raw data archival: Raw SLURM records are preserved in case reprocessing is needed.
You can use Polars to analyze the collected data. Here's an example:
from datetime import datetime, timedelta
from pathlib import Path
import polars as pl
# Load processed data for last 7 days
dfs = []
for i in range(7):
date = (datetime.now() - timedelta(days=i)).strftime("%Y-%m-%d")
file = Path(f"data/processed/{date}.parquet")
if file.exists():
dfs.append(pl.read_parquet(file))
if dfs:
df = pl.concat(dfs)
# Find users with worst CPU efficiency
worst_users = df.filter(pl.col("state") == "COMPLETED").group_by("user").agg(pl.col("cpu_efficiency").mean()).sort("cpu_efficiency").head(5)
print("## Users with Worst CPU Efficiency")
print(worst_users)
# Find most wasted resources by partition
waste_by_partition = df.group_by("partition").agg(pl.col("cpu_hours_wasted").sum()).sort("cpu_hours_wasted", descending=True)
print("\n## CPU Hours Wasted by Partition")
print(waste_by_partition)
else:
print("No data files found. Run `./slurm_usage.py collect` first.")No efficiency data?
- Check if SLURM accounting is configured:
scontrol show config | grep JobAcct - Verify jobs have
.batchsteps:sacct -j JOBID
Collection is slow?
- Increase parallel workers:
slurm-usage collect --n-parallel 8 - The first run processes historical data and will be slower
Missing user groups?
- Create or update the configuration file in
~/.config/slurm-usage/config.yaml - Ungrouped users will appear as "ungrouped" in group statistics
Script won't run?
- Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - Check SLURM access:
slurm-usage test(or./slurm_usage.py testif running from source)
MIT