Skip to content

[Bug]: Array jobs performance are anormally slow #75

@GridexX

Description

@GridexX

Description

Array job performance on Slurm deployed via Slinky Helm chart is 5-8x slower compared to on-premise Slurm installations. For 1000-task array jobs, K8s deployment takes 2m 54s vs 35s on-premise. Root cause analysis indicates hardcoded CRD configuration values prevent critical scheduler optimizations, specifically the inability to configure SelectType and related scheduler parameters.

The performance gap is primarily attributed to scheduler overhead rather than infrastructure issues (network latency, I/O, CPU throttling all tested and cleared).

Steps to Reproduce

Environment:

  • Slinky Helm chart version: 0.4.1
  • Slurm version: 25.11-ubuntu24.04
  • Kubernetes cluster with 2 worker pods (always running, no cold start)
  • Resources: <70% memory utilization, no throttling detected

Test job script:

#!/bin/bash
#SBATCH --job-name=stress_test
#SBATCH --output=/tmp/stress_test_%A_%a.out
#SBATCH --time=2
#SBATCH --mem=500
#SBATCH --array=1-1000
echo "Task: $SLURM_ARRAY_TASK_ID"

Steps:

  1. Deploy Slurm using Slinky Helm chart with default configuration
  2. Submit the test array job: sbatch stress_test.sh
  3. Monitor completion time from submission to all tasks complete
  4. Observe: 2m 54s average completion time (range: 88-271s across runs)

Performance metrics:

  • 100 array tasks: 10-12s
  • 500 array tasks: 47-70s
  • 1000 array tasks: 2m 54s average
  • Non-linear scaling indicates bottleneck saturation

Observed vs Slurm accounting discrepancy:

  • Observed total time: 2m 54s (real end-to-end)
  • Slurm accounting shows: 1.2s (actual task execution)
  • Gap: ~173s spent in scheduling/dispatch overhead

Controller logs show repeated partition calculations:

[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions

Expected Behavior

Array job performance should be comparable to on-premise deployments. For 1000 array tasks with minimal compute (echo command), expected completion time is 30-40 seconds, not 2-3 minutes.

Baseline comparison (on-premise Rocky8/OpenStack):

  • 1000 array tasks: 35.3s average
  • Linear scaling maintained across different array sizes

Expected configuration to be user-controllable:

  • SelectType (currently defaults to cons_tres behavior, should allow linear)
  • SchedulerTimeSlice (currently 30s, should be configurable to 5s for quick jobs)
  • SchedulerParameters (currently not set, should allow array job tuning)
  • MaxArraySize (currently 1001, insufficient for large-scale workloads)

Additional Context

Diagnostic Results

Infrastructure is not the bottleneck:

  • ✅ Network latency: 0.071ms average between controller and worker pods
  • ✅ No CPU/memory throttling detected
  • ✅ Database queries completing quickly
  • ✅ Task spawn rate: 100 tasks/second (excellent)
  • ✅ No shared storage I/O bottleneck

Root Cause: Hardcoded CRD Configuration

The Golang CRD generates slurm.conf with values that cannot be overridden via Helm controller.extraConfMap:

Current generated base config:

SelectType=<not explicitly set, defaults to cons_tres behavior>
SelectTypeParameters=CR_Core_Memory
SchedulerTimeSlice=30
SchedulerParameters=<not set>
MaxArraySize=1001
MaxNodeCount=1024

Attempted workaround:
Adding optimizations via controller.extraConfMap:

controller:
  extraConfMap:
    SelectType: select/linear
    SchedulerTimeSlice: 5
    SchedulerParameters:
      - max_array_tasks=1000
      - default_queue_depth=10000

Problems encountered:

  1. Values are appended to end of config, sometimes too late to take effect
  2. MaxNodeCount=1024 (hardcoded in base) is incompatible with SelectType=select/linear, causing error:
    sbatch: error: MaxNodeCount only compatible with cons_tres
    sbatch: fatal: Unable to process configuration file
    
  3. Cannot remove or override base hardcoded values

Scheduler Algorithm Overhead

select/cons_tres (current default) recalculates resource allocation for every array task, creating exponential overhead for large arrays. select/linear is significantly faster for homogeneous quick jobs but cannot be properly configured.

Proposed Solutions

Option 1: Make critical parameters configurable in CRD

controller:
  schedulerConfig:
    selectType: "select/linear"  # or "select/cons_tres"
    selectTypeParameters: "CR_Core_Memory"
    schedulerTimeSlice: 5
    schedulerParameters:
      - max_array_tasks=1000
      - default_queue_depth=10000
      - batch_sched_delay=1
    maxArraySize: 100000
    maxNodeCount: null  # Optional, only for cons_tres

Option 2: Performance profile system

controller:
  performanceProfile: "array-jobs"  # Applies optimized preset

Option 3: Make base config values overridable
Allow controller.extraConfMap to override (not just append to) base CRD-generated values.

Impact

This bottleneck significantly impacts workloads common in:

  • Scientific computing (parameter sweeps)
  • Data processing pipelines (parallel batch jobs)
  • ML/AI training (hyperparameter optimization)
  • Rendering farms (frame-by-frame processing)

For organizations migrating from on-premise to Kubernetes-based Slurm, this 5-8x performance degradation is a blocker for adoption.

Diagnostic Tools Available

Full diagnostic scripts and detailed performance measurements available. Happy to provide additional profiling data or test proposed fixes.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions