Skip to content

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Oct 2, 2025

close #6436

This PR improves error handling in the K8s task handler by prioritizing the exit code from the Kubernetes API over the .exitcode file created by Nextflow.

Rationale:
In case of errors like OOMKilled or pod eviction, the container may terminate abruptly before the exit file is written. The exit code from the K8s API (via the container's terminated state) is more reliable in these scenarios.

Implementation:

  • First check the K8s container terminated state for the exit code
  • If the K8s exit code is 0 (successful) or missing, fall back to reading from the .exitcode file
  • This ensures proper error detection and reporting for abrupt container terminations

Reference:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.30/#containerstateterminated-v1-core

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee self-assigned this Oct 2, 2025
@netlify
Copy link

netlify bot commented Oct 2, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 2d6f5ed
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/68dfe05aa3e6c500088a0248

@pditommaso
Copy link
Member

Any test? more context?

@jorgee
Copy link
Contributor Author

jorgee commented Oct 2, 2025

A customer reported the OOM retries by 137 exit code stop working when moving from AWSBatch to K8s

@jorgee
Copy link
Contributor Author

jorgee commented Oct 3, 2025

Seems working with a local k3d deployment.

nextflow kuberun -head-image jorgeejarquea/nextflow:25.08.0-edge-3 robsyme/nf-test -latest -r mem-testing -v nextflow-pvc:/mnt/data/launch --memory "3 GB"
Pod started: cheesy-galileo
N E X T F L O W  ~  version 25.08.0-edge
Pulling robsyme/nf-test ...
 Already-up-to-date
Launching `https://github.com/robsyme/nf-test` [cheesy-galileo] DSL2 - revision: 1ccbd5b66f [mem-testing]
[ee/a6f156] Submitted process > UseMem
ERROR ~ Error executing process > 'UseMem'

Caused by:
  Process `UseMem` terminated with an error exit status (137)


Command executed:

  allocate.py 3 1

Command exit status:
  137

Command output:
  (empty)

Work dir:
  /mnt/data/launch/jorgee/work/ee/a6f15618a16163f45bd7a00860ea51

Container:
  docker.io/python:3.10.14

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee marked this pull request as ready for review October 3, 2025 11:46
jorgee and others added 3 commits October 3, 2025 16:17
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso merged commit f258a75 into master Oct 3, 2025
11 checks passed
@pditommaso pditommaso deleted the fix_oom_k8s branch October 3, 2025 14:48
pditommaso added a commit that referenced this pull request Oct 6, 2025
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Co-authored-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOM do not return 137 exit code in K8s executor

4 participants