Skip to content

[Regression] Infinite reconciliation for unschedulable AppWrappers #618

Closed
@sutaakar

Description

@sutaakar

Describe the Bug

When I create an AppWrapper with custompodresources CPU requests and limits larger than available cluster CPU then MCAD controller gets stuck in infinite reconciliation loop - starts to reconcile the AppWrapper every couple of milliseconds, completely cluttering MCAD log.

Created AppWrapper (taken from CodeFlare operator e2e test suite and adjusted requests and limits):

apiVersion: mcad.ibm.com/v1beta1
kind: AppWrapper
metadata:
  name: mnist
spec:
  resources:
    GenericItems:
      - allocated: 0
        custompodresources:
          - limits:
              cpu: '4'
              memory: 1G
            replicas: 1
            requests:
              cpu: '4'
              memory: 512Mi
        generictemplate:
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: mnist
            namespace: test-ns-xqlv6
          spec:
            completions: 1
            parallelism: 1
            template:
              metadata:
                creationTimestamp: null
              spec:
                containers:
                  - command:
                      - /bin/sh
                      - '-c'
                      - >-
                        pip install -r /test/requirements.txt && torchrun
                        /test/mnist.py
                    image: 'pytorch/pytorch:1.11.0-cuda11.3-cudnn8-runtime'
                    name: job
                    resources: {}
                    volumeMounts:
                      - mountPath: /test
                        name: test
                restartPolicy: Never
                volumes:
                  - configMap:
                      name: mnist-mcad
                    name: test
          status: {}
        priority: 0
        priorityslope: 0
        replicas: 1
  schedulingSpec:
    dispatchDuration: {}
    requeuing:
      growthType: exponential
      maxNumRequeuings: 0
      maxTimeInSeconds: 0
      numRequeuings: 0
      timeInSeconds: 300
  service:
    spec: {}

The behavior was observed in MCAD 1.34.0, it is a regression from 1.33.0 as I didn't reproduce it there.

Codeflare Stack Component Versions

Please specify the component versions in which you have encountered this bug.

Codeflare SDK: N/A
MCAD: v1.34.0
Instascale: N/A
Codeflare Operator: latest version from main branch, running locally
Other: Tested on OpenShift CRC

Steps to Reproduce the Bug

  1. Deploy latest CodeFlare operator and MCAD (i.e. using OLM)
  2. Check cluster node CPU
  3. Adjust the sample AppWrapper to require more CPU then cluster node CPU
  4. Create AppWrapper on cluster
  5. Check MCAD logs

What Have You Already Tried to Debug the Issue?

I have observed MCAD logs for MCAD 1.33.0 and 1.34.0 to see the difference in log size and content. No other debugging.

Expected Behavior

Reconciliation for unschedulable AppWrappers should respect and keep retry intervals, like in MCAD 1.33.0.

Screenshots, Console Output, Logs, etc.

N/A

Affected Releases

v1.34.0

Additional Context

Add as applicable and when known:

  • OS: Linux
  • OS Version: Fedora 38
  • Browser (UI issues): N/A
  • Browser Version (UI issues): N/A
  • Cloud: on-premise
  • Kubernetes: OpenShift CRC, was observed also in KinD
  • OpenShift or K8s version: OCP 4.13
  • Other relevant info

Add any other information you think might be useful here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions