Skip to content

[Bug]: slurmctld crashes with double free or corruption when MetricsType=metrics/openmetrics enabled on 25.11 (Kubernetes + slurm-operator) #90

@narcotis

Description

@narcotis

Description

On Slurm 25.11, enabling the new metrics/openmetrics feature causes
slurmctld to periodically crash with glibc heap errors:

double free or corruption (fasttop)
double free or corruption (!prev)

Disabling only the metrics feature (removing MetricsType=metrics/openmetrics
from slurm.conf) fully resolves the problem. All other components and workflow
remain unchanged.

This appears to be a memory corruption issue triggered by the combination of:

  • metrics/openmetrics,
  • frequent node state updates (REQUEST_UPDATE_NODE) issued by a Kubernetes
    slurm-operator, and
  • concurrent RPC requests (REQUEST_NODE_INFO, REQUEST_JOB_INFO,
    REQUEST_PARTITION_INFO) coming from local services.

Once the metrics plugin is disabled, slurmctld becomes stable under the same workload.

Steps to Reproduce

Environment:

  • Slurmctld 25.11 (amd64, aarch64 both tested)
  • Kubernetes deployment (controller + workers as Pods)
  • slurm-operator (v1.0.0)
  • Metrics plugin enabled:

slurm.conf

CgroupPlugin=autodetect
IgnoreSystemd=yes
EnableControllers=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainSwapSpace=yes

AutoDetect=nvidia

MetricsType=metrics/openmetrics

Reproduction scenario:

  1. Start slurmctld with metrics/openmetrics enabled.
  2. Allow the slurm-operator to manage node states (Pod creation & deletion).
    This results in bursts of:
  • REQUEST_UPDATE_NODE
  • REQUEST_NODE_INFO
  • REQUEST_PARTITION_INFO
  • MESSAGE_NODE_REGISTRATION_STATUS
  1. Trigger or wait for:
  • reconfigures, or
  • metric scrapes from monitoring systems.
  1. Observe that slurmctld frequently aborts with:
double free or corruption (fasttop)
SIGABRT (core dumped)

Disabling only the metrics plugin makes all crashes disappear,
even under the same load, operator churn, and reconfigure events.

Expected Behavior

slurmctld should remain stable with metrics/openmetrics enabled,
including during:

  • RECONFIGURE events,
  • frequent node state updates from operators,
  • concurrent metric scrapes,
  • high RPC activity, and
  • elevated debug logging (debug5).

Additional Context

Crash examples:

double free or corruption (!prev)
2025-12-09 05:16:11,712 WARN exited: slurmctld (SIGABRT, core dumped)

Crashes consistently align with:

  • multiple simultaneous RPC connections from localhost:6817,
  • rapid REQUEST_UPDATE_NODE (cordon/uncordon) calls,
  • scheduler loop execution, and
  • metric scrape or reconfigure timing.

Relevant configuration snippets:

MetricsType=metrics/openmetrics
SlurmctldDebug=debug5
CgroupPlugin=autodetect
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
GresTypes=gpu
AutoDetect=nvidia

For I'm working on AWS, I couldn't run more gdb bt to figure out the backtrace logs.

This Issue report is a summary created by ChatGPT, from the conversation debugging the issue with me.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions