[Bug]: slurmctld crashes with double free or corruption when MetricsType=metrics/openmetrics enabled on 25.11 (Kubernetes + slurm-operator)

## Description
On **Slurm 25.11**, enabling the new **`metrics/openmetrics`** feature causes
`slurmctld` to periodically crash with glibc heap errors:

```
double free or corruption (fasttop)
double free or corruption (!prev)
```

Disabling only the metrics feature (removing `MetricsType=metrics/openmetrics`
from `slurm.conf`) fully resolves the problem. All other components and workflow
remain unchanged.

This appears to be a memory corruption issue triggered by the combination of:

- `metrics/openmetrics`,
- frequent node state updates (`REQUEST_UPDATE_NODE`) issued by a Kubernetes
  **slurm-operator**, and
- concurrent RPC requests (`REQUEST_NODE_INFO`, `REQUEST_JOB_INFO`,
  `REQUEST_PARTITION_INFO`) coming from local services.

Once the metrics plugin is disabled, `slurmctld` becomes stable under the same workload.


## Steps to Reproduce

Environment:
- Slurmctld **25.11** (amd64, aarch64 both tested)
- Kubernetes deployment (controller + workers as Pods)
- slurm-operator (v1.0.0)
- Metrics plugin enabled:

slurm.conf
```
CgroupPlugin=autodetect
IgnoreSystemd=yes
EnableControllers=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainSwapSpace=yes

AutoDetect=nvidia

MetricsType=metrics/openmetrics
```

Reproduction scenario:

1. Start slurmctld with metrics/openmetrics enabled.
2. Allow the slurm-operator to manage node states (Pod creation & deletion).
This results in bursts of:
- `REQUEST_UPDATE_NODE`
- `REQUEST_NODE_INFO`
- `REQUEST_PARTITION_INFO`
- `MESSAGE_NODE_REGISTRATION_STATUS`
3. Trigger or wait for:
- reconfigures, or
- metric scrapes from monitoring systems.
4. Observe that `slurmctld` frequently aborts with:
```
double free or corruption (fasttop)
SIGABRT (core dumped)
```
Disabling only the metrics plugin makes all crashes disappear,
even under the same load, operator churn, and reconfigure events.


## Expected Behavior

`slurmctld` should remain stable with `metrics/openmetrics` enabled,
including during:
- RECONFIGURE events,
- frequent node state updates from operators,
- concurrent metric scrapes,
- high RPC activity, and
- elevated debug logging (debug5).

## Additional Context
Crash examples:
```
double free or corruption (!prev)
2025-12-09 05:16:11,712 WARN exited: slurmctld (SIGABRT, core dumped)
```

Crashes consistently align with:
- multiple simultaneous RPC connections from localhost:6817,
- rapid `REQUEST_UPDATE_NODE` (cordon/uncordon) calls,
- scheduler loop execution, and
- metric scrape or reconfigure timing.

Relevant configuration snippets:
```
MetricsType=metrics/openmetrics
SlurmctldDebug=debug5
CgroupPlugin=autodetect
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup
GresTypes=gpu
AutoDetect=nvidia
```

For I'm working on AWS, I couldn't run more `gdb bt` to figure out the backtrace logs.


This Issue report is a summary created by ChatGPT, from the conversation debugging the issue with me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: slurmctld crashes with double free or corruption when MetricsType=metrics/openmetrics enabled on 25.11 (Kubernetes + slurm-operator) #90

Description

Steps to Reproduce

Expected Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: slurmctld crashes with double free or corruption when MetricsType=metrics/openmetrics enabled on 25.11 (Kubernetes + slurm-operator) #90

Description

Description

Steps to Reproduce

Expected Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions