-
Notifications
You must be signed in to change notification settings - Fork 63
Description
Description
Array job performance on Slurm deployed via Slinky Helm chart is 5-8x slower compared to on-premise Slurm installations. For 1000-task array jobs, K8s deployment takes 2m 54s vs 35s on-premise. Root cause analysis indicates hardcoded CRD configuration values prevent critical scheduler optimizations, specifically the inability to configure SelectType and related scheduler parameters.
The performance gap is primarily attributed to scheduler overhead rather than infrastructure issues (network latency, I/O, CPU throttling all tested and cleared).
Steps to Reproduce
Environment:
- Slinky Helm chart version: 0.4.1
- Slurm version: 25.11-ubuntu24.04
- Kubernetes cluster with 2 worker pods (always running, no cold start)
- Resources: <70% memory utilization, no throttling detected
Test job script:
#!/bin/bash
#SBATCH --job-name=stress_test
#SBATCH --output=/tmp/stress_test_%A_%a.out
#SBATCH --time=2
#SBATCH --mem=500
#SBATCH --array=1-1000
echo "Task: $SLURM_ARRAY_TASK_ID"Steps:
- Deploy Slurm using Slinky Helm chart with default configuration
- Submit the test array job:
sbatch stress_test.sh - Monitor completion time from submission to all tasks complete
- Observe: 2m 54s average completion time (range: 88-271s across runs)
Performance metrics:
- 100 array tasks: 10-12s
- 500 array tasks: 47-70s
- 1000 array tasks: 2m 54s average
- Non-linear scaling indicates bottleneck saturation
Observed vs Slurm accounting discrepancy:
- Observed total time: 2m 54s (real end-to-end)
- Slurm accounting shows: 1.2s (actual task execution)
- Gap: ~173s spent in scheduling/dispatch overhead
Controller logs show repeated partition calculations:
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
Expected Behavior
Array job performance should be comparable to on-premise deployments. For 1000 array tasks with minimal compute (echo command), expected completion time is 30-40 seconds, not 2-3 minutes.
Baseline comparison (on-premise Rocky8/OpenStack):
- 1000 array tasks: 35.3s average
- Linear scaling maintained across different array sizes
Expected configuration to be user-controllable:
SelectType(currently defaults to cons_tres behavior, should allow linear)SchedulerTimeSlice(currently 30s, should be configurable to 5s for quick jobs)SchedulerParameters(currently not set, should allow array job tuning)MaxArraySize(currently 1001, insufficient for large-scale workloads)
Additional Context
Diagnostic Results
Infrastructure is not the bottleneck:
- ✅ Network latency: 0.071ms average between controller and worker pods
- ✅ No CPU/memory throttling detected
- ✅ Database queries completing quickly
- ✅ Task spawn rate: 100 tasks/second (excellent)
- ✅ No shared storage I/O bottleneck
Root Cause: Hardcoded CRD Configuration
The Golang CRD generates slurm.conf with values that cannot be overridden via Helm controller.extraConfMap:
Current generated base config:
SelectType=<not explicitly set, defaults to cons_tres behavior>
SelectTypeParameters=CR_Core_Memory
SchedulerTimeSlice=30
SchedulerParameters=<not set>
MaxArraySize=1001
MaxNodeCount=1024
Attempted workaround:
Adding optimizations via controller.extraConfMap:
controller:
extraConfMap:
SelectType: select/linear
SchedulerTimeSlice: 5
SchedulerParameters:
- max_array_tasks=1000
- default_queue_depth=10000Problems encountered:
- Values are appended to end of config, sometimes too late to take effect
MaxNodeCount=1024(hardcoded in base) is incompatible withSelectType=select/linear, causing error:sbatch: error: MaxNodeCount only compatible with cons_tres sbatch: fatal: Unable to process configuration file- Cannot remove or override base hardcoded values
Scheduler Algorithm Overhead
select/cons_tres (current default) recalculates resource allocation for every array task, creating exponential overhead for large arrays. select/linear is significantly faster for homogeneous quick jobs but cannot be properly configured.
Proposed Solutions
Option 1: Make critical parameters configurable in CRD
controller:
schedulerConfig:
selectType: "select/linear" # or "select/cons_tres"
selectTypeParameters: "CR_Core_Memory"
schedulerTimeSlice: 5
schedulerParameters:
- max_array_tasks=1000
- default_queue_depth=10000
- batch_sched_delay=1
maxArraySize: 100000
maxNodeCount: null # Optional, only for cons_tresOption 2: Performance profile system
controller:
performanceProfile: "array-jobs" # Applies optimized presetOption 3: Make base config values overridable
Allow controller.extraConfMap to override (not just append to) base CRD-generated values.
Impact
This bottleneck significantly impacts workloads common in:
- Scientific computing (parameter sweeps)
- Data processing pipelines (parallel batch jobs)
- ML/AI training (hyperparameter optimization)
- Rendering farms (frame-by-frame processing)
For organizations migrating from on-premise to Kubernetes-based Slurm, this 5-8x performance degradation is a blocker for adoption.
Diagnostic Tools Available
Full diagnostic scripts and detailed performance measurements available. Happy to provide additional profiling data or test proposed fixes.