Description
Describe the Bug
When I create an AppWrapper with custompodresources CPU requests and limits larger than available cluster CPU then MCAD controller gets stuck in infinite reconciliation loop - starts to reconcile the AppWrapper every couple of milliseconds, completely cluttering MCAD log.
Created AppWrapper (taken from CodeFlare operator e2e test suite and adjusted requests and limits):
apiVersion: mcad.ibm.com/v1beta1
kind: AppWrapper
metadata:
name: mnist
spec:
resources:
GenericItems:
- allocated: 0
custompodresources:
- limits:
cpu: '4'
memory: 1G
replicas: 1
requests:
cpu: '4'
memory: 512Mi
generictemplate:
apiVersion: batch/v1
kind: Job
metadata:
name: mnist
namespace: test-ns-xqlv6
spec:
completions: 1
parallelism: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- command:
- /bin/sh
- '-c'
- >-
pip install -r /test/requirements.txt && torchrun
/test/mnist.py
image: 'pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime'
name: job
resources: {}
volumeMounts:
- mountPath: /test
name: test
restartPolicy: Never
volumes:
- configMap:
name: mnist-mcad
name: test
status: {}
priority: 0
priorityslope: 0
replicas: 1
schedulingSpec:
dispatchDuration: {}
requeuing:
growthType: exponential
maxNumRequeuings: 0
maxTimeInSeconds: 0
numRequeuings: 0
timeInSeconds: 300
service:
spec: {}
The behavior was observed in MCAD 1.34.0, it is a regression from 1.33.0 as I didn't reproduce it there.
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Codeflare SDK: N/A
MCAD: v1.34.0
Instascale: N/A
Codeflare Operator: latest version from main branch, running locally
Other: Tested on OpenShift CRC
Steps to Reproduce the Bug
- Deploy latest CodeFlare operator and MCAD (i.e. using OLM)
- Check cluster node CPU
- Adjust the sample AppWrapper to require more CPU then cluster node CPU
- Create AppWrapper on cluster
- Check MCAD logs
What Have You Already Tried to Debug the Issue?
I have observed MCAD logs for MCAD 1.33.0 and 1.34.0 to see the difference in log size and content. No other debugging.
Expected Behavior
Reconciliation for unschedulable AppWrappers should respect and keep retry intervals, like in MCAD 1.33.0.
Screenshots, Console Output, Logs, etc.
N/A
Affected Releases
v1.34.0
Additional Context
Add as applicable and when known:
- OS: Linux
- OS Version: Fedora 38
- Browser (UI issues): N/A
- Browser Version (UI issues): N/A
- Cloud: on-premise
- Kubernetes: OpenShift CRC, was observed also in KinD
- OpenShift or K8s version: OCP 4.13
- Other relevant info
Add any other information you think might be useful here.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status