The Celery Worker Pods terminated and lost task logs when scale down happened during HPA autoscaling #41163
Open
2 tasks done
Labels
area:helm-chart
Airflow Helm Chart
area:logging
area:providers
kind:bug
This is a clearly a bug
provider:cncf-kubernetes
Kubernetes provider related issues
Official Helm Chart version
1.15.0 (latest released)
Apache Airflow version
2.7.1
Kubernetes Version
1.27.2
Helm Chart configuration
After official helm charts supports HPA, then here is my HPA settings.
Docker Image customizations
Not too much special packages, just added openjdk, mongo client and some other python related dependency.
What happened
When I enabled HPA in the Airflow cluster, then during the scale down some long running tasks which might be over 15hrs got failed and lost the logs. More details.
terminationGracePeriodSeconds
was hitting. And then it caused the task failed, but actually the tasks still waiting the final response status from dataproc jobs.terminationGracePeriodSeconds
I set is 1hour right now, but due to the long running tasks so i am not sure how the HPA settings can support it.What you think should happen instead
It should be happened during HPA scaling down either the celery work pod need to graceful shutdown until all tasks are completed or Airflow should provide some config or something else to support kill the task process before final termination deadline and complete the cleanup and upload logs to remote settings like GCS bucket.
How to reproduce
terminationGracePeriodSeconds
maybe to a small number.Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: