[Bug]: Array jobs performance are anormally slow

## Description

Array job performance on Slurm deployed via Slinky Helm chart is 5-8x slower compared to on-premise Slurm installations. For 1000-task array jobs, K8s deployment takes 2m 54s vs 35s on-premise. Root cause analysis indicates hardcoded CRD configuration values prevent critical scheduler optimizations, specifically the inability to configure `SelectType` and related scheduler parameters.

The performance gap is primarily attributed to scheduler overhead rather than infrastructure issues (network latency, I/O, CPU throttling all tested and cleared).

## Steps to Reproduce

**Environment:**
- Slinky Helm chart version: 0.4.1
- Slurm version: 25.11-ubuntu24.04
- Kubernetes cluster with 2 worker pods (always running, no cold start)
- Resources: <70% memory utilization, no throttling detected

**Test job script:**
```bash
#!/bin/bash
#SBATCH --job-name=stress_test
#SBATCH --output=/tmp/stress_test_%A_%a.out
#SBATCH --time=2
#SBATCH --mem=500
#SBATCH --array=1-1000
echo "Task: $SLURM_ARRAY_TASK_ID"
```

**Steps:**
1. Deploy Slurm using Slinky Helm chart with default configuration
2. Submit the test array job: `sbatch stress_test.sh`
3. Monitor completion time from submission to all tasks complete
4. Observe: 2m 54s average completion time (range: 88-271s across runs)

**Performance metrics:**
- 100 array tasks: 10-12s
- 500 array tasks: 47-70s  
- 1000 array tasks: 2m 54s average
- Non-linear scaling indicates bottleneck saturation

**Observed vs Slurm accounting discrepancy:**
- Observed total time: 2m 54s (real end-to-end)
- Slurm accounting shows: 1.2s (actual task execution)
- Gap: ~173s spent in scheduling/dispatch overhead

**Controller logs show repeated partition calculations:**
```
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
[2025-11-10T08:20:06] select/cons_tres: part_data_create_array: preparing for 2 partitions
```

## Expected Behavior

Array job performance should be comparable to on-premise deployments. For 1000 array tasks with minimal compute (echo command), expected completion time is **30-40 seconds**, not 2-3 minutes.

**Baseline comparison (on-premise Rocky8/OpenStack):**
- 1000 array tasks: 35.3s average
- Linear scaling maintained across different array sizes

**Expected configuration to be user-controllable:**
- `SelectType` (currently defaults to cons_tres behavior, should allow linear)
- `SchedulerTimeSlice` (currently 30s, should be configurable to 5s for quick jobs)
- `SchedulerParameters` (currently not set, should allow array job tuning)
- `MaxArraySize` (currently 1001, insufficient for large-scale workloads)

## Additional Context

### Diagnostic Results
Infrastructure is not the bottleneck:
- ✅ Network latency: 0.071ms average between controller and worker pods
- ✅ No CPU/memory throttling detected
- ✅ Database queries completing quickly
- ✅ Task spawn rate: 100 tasks/second (excellent)
- ✅ No shared storage I/O bottleneck

### Root Cause: Hardcoded CRD Configuration

The Golang CRD generates `slurm.conf` with values that cannot be overridden via Helm `controller.extraConfMap`:

**Current generated base config:**
```
SelectType=<not explicitly set, defaults to cons_tres behavior>
SelectTypeParameters=CR_Core_Memory
SchedulerTimeSlice=30
SchedulerParameters=<not set>
MaxArraySize=1001
MaxNodeCount=1024
```

**Attempted workaround:**
Adding optimizations via `controller.extraConfMap`:
```yaml
controller:
  extraConfMap:
    SelectType: select/linear
    SchedulerTimeSlice: 5
    SchedulerParameters:
      - max_array_tasks=1000
      - default_queue_depth=10000
```

**Problems encountered:**
1. Values are appended to end of config, sometimes too late to take effect
2. `MaxNodeCount=1024` (hardcoded in base) is incompatible with `SelectType=select/linear`, causing error:
   ```
   sbatch: error: MaxNodeCount only compatible with cons_tres
   sbatch: fatal: Unable to process configuration file
   ```
3. Cannot remove or override base hardcoded values

### Scheduler Algorithm Overhead

`select/cons_tres` (current default) recalculates resource allocation for every array task, creating exponential overhead for large arrays. `select/linear` is significantly faster for homogeneous quick jobs but cannot be properly configured.

### Proposed Solutions

**Option 1: Make critical parameters configurable in CRD**
```yaml
controller:
  schedulerConfig:
    selectType: "select/linear"  # or "select/cons_tres"
    selectTypeParameters: "CR_Core_Memory"
    schedulerTimeSlice: 5
    schedulerParameters:
      - max_array_tasks=1000
      - default_queue_depth=10000
      - batch_sched_delay=1
    maxArraySize: 100000
    maxNodeCount: null  # Optional, only for cons_tres
```

**Option 2: Performance profile system**
```yaml
controller:
  performanceProfile: "array-jobs"  # Applies optimized preset
```

**Option 3: Make base config values overridable**
Allow `controller.extraConfMap` to override (not just append to) base CRD-generated values.

### Impact

This bottleneck significantly impacts workloads common in:
- Scientific computing (parameter sweeps)
- Data processing pipelines (parallel batch jobs)
- ML/AI training (hyperparameter optimization)
- Rendering farms (frame-by-frame processing)

For organizations migrating from on-premise to Kubernetes-based Slurm, this 5-8x performance degradation is a blocker for adoption.

### Diagnostic Tools Available

Full diagnostic scripts and detailed performance measurements available. Happy to provide additional profiling data or test proposed fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Array jobs performance are anormally slow #75

Description

Steps to Reproduce

Expected Behavior

Additional Context

Diagnostic Results

Root Cause: Hardcoded CRD Configuration

Scheduler Algorithm Overhead

Proposed Solutions

Impact

Diagnostic Tools Available

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Array jobs performance are anormally slow #75

Description

Description

Steps to Reproduce

Expected Behavior

Additional Context

Diagnostic Results

Root Cause: Hardcoded CRD Configuration

Scheduler Algorithm Overhead

Proposed Solutions

Impact

Diagnostic Tools Available

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions