Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods remain in Completing state, inconsistent with specified lifecycle policy #1956

Closed
kye308 opened this issue Jan 13, 2022 · 16 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@kye308
Copy link

kye308 commented Jan 13, 2022

What happened: Pods in a job remain in Running even though another task with policy CompleteJob on TaskSucceeded has completed.

What you expected to happen: Pods are completed after the task has completed.

How to reproduce it (as minimally and precisely as possible):

vcjob yaml used

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vc14-test-1
spec:
  minAvailable: 3
  plugins:
    env: []
    svc: []
  policies:
  - action: RestartTask
    events:
    - PodEvicted
  - action: TerminateJob
    events:
    - PodFailed
  queue: default
  tasks:
  - name: driver
    policies:
    - action: CompleteJob
      events:
      - TaskCompleted
    replicas: 1
    template:
      spec:
        containers:
        - args:
          - -c
          - "echo Driver Started!; sleep 20; echo Still working...; sleep 20; echo Job Finished!" 
          command:
            - /bin/sh
          image: ubuntu:20.04
          name: container-0
          volumeMounts:
          - mountPath: /dev/shm
            name: bigger-shm
        restartPolicy: Never
        terminationGracePeriodSeconds: 90
        volumes:
        - emptyDir:
            medium: Memory
          name: bigger-shm
  - name: worker
    replicas: 2
    template:
      spec:
        containers:
        - args:
          - -c
          - "echo worker Started!; sleep 450; echo Still working...; sleep 450; echo Job Finished!" 
          command:
            - /bin/sh
          image: ubuntu:20.04
          name: container-0
          volumeMounts:
          - mountPath: /dev/shm
            name: bigger-shm
        restartPolicy: Never
        terminationGracePeriodSeconds: 90
        volumes:
        - emptyDir:
            medium: Memory
          name: bigger-shm

vcjob status

status:
  minAvailable: 3
  running: 2
  runningDuration: 45.576421357s
  state:
    lastTransitionTime: "2022-01-13T22:16:09Z"
    phase: Completing
  succeeded: 1
  taskStatusCount:
    driver:
      phase:
        Succeeded: 1
    worker:
      phase:
        Running: 2
  version: 3

Pods finally were completed after almost 24 hours.

Anything else we need to know?:
This was previously working on version 1.3.0. Pods would be marked completed within 10 minutes.

Environment:

  • Volcano Version: 1.4.0
@kye308 kye308 added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2022
@kye308
Copy link
Author

kye308 commented Jan 13, 2022

for this job the pods reached completed state after about an hour

@kye308
Copy link
Author

kye308 commented Jan 15, 2022

@kye308
Copy link
Author

kye308 commented Jan 15, 2022

@hwdef
Copy link
Member

hwdef commented Jan 15, 2022

/cc

@hwdef
Copy link
Member

hwdef commented Jan 18, 2022

#1746

Please use the volcano version after this commit

@kye308
Copy link
Author

kye308 commented Jan 18, 2022

@hwdef we are using the v1.4.0 tag which appears to include this change already. Is my understanding correct?

@hwdef
Copy link
Member

hwdef commented Jan 19, 2022

Yes, 1.4.0 includes this commit, may be caused by other reasons.

@william-wang
Copy link
Member

@hwdef @shinytang6 We need to try to reproduce the issue based on kye308's input in our envrionment.

@kye308
Copy link
Author

kye308 commented Jan 27, 2022

Hi all,

I was able to fix this issue by cherry-picking #1719. Would it be possible to cherry-pick that change into the v1.4.0 release?

@william-wang
Copy link
Member

@kye308 Sure. It's reasonable to cherry-pick it to v1.4.0.

@kye308
Copy link
Author

kye308 commented Jan 29, 2022

@william-wang do you have any estimate on when this can be included in the v1.4.0 release on dockerhub?

@william-wang
Copy link
Member

@kye308 The bugfix has been merged to release-1.4 branch. We planed to update the image on dockerhub this week.

@kye308
Copy link
Author

kye308 commented Feb 7, 2022

@william-wang 👍 thanks for the update

@kye308
Copy link
Author

kye308 commented Feb 18, 2022

@william-wang any updates on the dockerhub image? would it include this change as well: #2026

@stale
Copy link

stale bot commented May 31, 2022

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 31, 2022
@stale
Copy link

stale bot commented Jul 30, 2022

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Jul 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants