-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Description
On Slurm 25.11, enabling the new metrics/openmetrics feature causes
slurmctld to periodically crash with glibc heap errors:
double free or corruption (fasttop)
double free or corruption (!prev)
Disabling only the metrics feature (removing MetricsType=metrics/openmetrics
from slurm.conf) fully resolves the problem. All other components and workflow
remain unchanged.
This appears to be a memory corruption issue triggered by the combination of:
metrics/openmetrics,- frequent node state updates (
REQUEST_UPDATE_NODE) issued by a Kubernetes
slurm-operator, and - concurrent RPC requests (
REQUEST_NODE_INFO,REQUEST_JOB_INFO,
REQUEST_PARTITION_INFO) coming from local services.
Once the metrics plugin is disabled, slurmctld becomes stable under the same workload.
Steps to Reproduce
Environment:
- Slurmctld 25.11 (amd64, aarch64 both tested)
- Kubernetes deployment (controller + workers as Pods)
- slurm-operator (v1.0.0)
- Metrics plugin enabled:
slurm.conf
CgroupPlugin=autodetect
IgnoreSystemd=yes
EnableControllers=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainSwapSpace=yes
AutoDetect=nvidia
MetricsType=metrics/openmetrics
Reproduction scenario:
- Start slurmctld with metrics/openmetrics enabled.
- Allow the slurm-operator to manage node states (Pod creation & deletion).
This results in bursts of:
REQUEST_UPDATE_NODEREQUEST_NODE_INFOREQUEST_PARTITION_INFOMESSAGE_NODE_REGISTRATION_STATUS
- Trigger or wait for:
- reconfigures, or
- metric scrapes from monitoring systems.
- Observe that
slurmctldfrequently aborts with:
double free or corruption (fasttop)
SIGABRT (core dumped)
Disabling only the metrics plugin makes all crashes disappear,
even under the same load, operator churn, and reconfigure events.
Expected Behavior
slurmctld should remain stable with metrics/openmetrics enabled,
including during:
- RECONFIGURE events,
- frequent node state updates from operators,
- concurrent metric scrapes,
- high RPC activity, and
- elevated debug logging (debug5).
Additional Context
Crash examples:
double free or corruption (!prev)
2025-12-09 05:16:11,712 WARN exited: slurmctld (SIGABRT, core dumped)
Crashes consistently align with:
- multiple simultaneous RPC connections from localhost:6817,
- rapid
REQUEST_UPDATE_NODE(cordon/uncordon) calls, - scheduler loop execution, and
- metric scrape or reconfigure timing.
Relevant configuration snippets:
MetricsType=metrics/openmetrics
SlurmctldDebug=debug5
CgroupPlugin=autodetect
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
GresTypes=gpu
AutoDetect=nvidia
For I'm working on AWS, I couldn't run more gdb bt to figure out the backtrace logs.
This Issue report is a summary created by ChatGPT, from the conversation debugging the issue with me.