Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Celery Worker Pods terminated and lost task logs when scale down happened during HPA autoscaling #41163

Open
2 tasks done
andycai-sift opened this issue Jul 31, 2024 · 2 comments
Assignees
Labels
area:helm-chart Airflow Helm Chart area:logging area:providers kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues

Comments

@andycai-sift
Copy link

Official Helm Chart version

1.15.0 (latest released)

Apache Airflow version

2.7.1

Kubernetes Version

1.27.2

Helm Chart configuration

After official helm charts supports HPA, then here is my HPA settings.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: "airflow-worker"
  namespace: airflow
  labels:
    app: airflow
    component: worker
    release: airflow
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: "airflow-worker"
  minReplicas: 2
  maxReplicas:16
  metrics:
    - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
    - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Docker Image customizations

Not too much special packages, just added openjdk, mongo client and some other python related dependency.

What happened

When I enabled HPA in the Airflow cluster, then during the scale down some long running tasks which might be over 15hrs got failed and lost the logs. More details.

  • The long running task for example is calling python api to spin up a google dataproc cluster which will run the job over 15hrs. When the HPA started scale down the pods, then these tasks failed looks due to the pod terminated after the terminationGracePeriodSeconds was hitting. And then it caused the task failed, but actually the tasks still waiting the final response status from dataproc jobs.
  • Also after the pod terminated, then the tasks' logs actually was not published to the GCS bucket. I set the remote log to GCS bucket already. But the logs actually are still in persistent disk side, I have to spin up the same pod then can upload the logs to GCS bucket. again.
  • The terminationGracePeriodSeconds I set is 1hour right now, but due to the long running tasks so i am not sure how the HPA settings can support it.

What you think should happen instead

It should be happened during HPA scaling down either the celery work pod need to graceful shutdown until all tasks are completed or Airflow should provide some config or something else to support kill the task process before final termination deadline and complete the cleanup and upload logs to remote settings like GCS bucket.

How to reproduce

  • Using official helm chart with HPA enabled with CPU/Memory metrics deploy airflow to Kubernetes.
  • Setup remote logging
logging:
    delete_local_logs: 'False'
    remote_base_log_folder: "gs://xxx"
    remote_logging: 'True'
  • Run a long running task maybe just need 30mins or few hours, but you have to make sure setup terminationGracePeriodSeconds maybe to a small number.
  • Try to trigger the HPA scale down the pods, then you should see the issue from the UI you should see the task failed without logs.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@andycai-sift andycai-sift added area:helm-chart Airflow Helm Chart kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 31, 2024
Copy link

boring-cyborg bot commented Jul 31, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added area:logging area:providers provider:cncf-kubernetes Kubernetes provider related issues labels Jul 31, 2024
@shahar1 shahar1 removed the needs-triage label for new issues that we didn't triage yet label Aug 2, 2024
@dimon222
Copy link
Contributor

dimon222 commented Aug 10, 2024

Regarding first proposed solution - HPA waiting for completion of work:
I'm afraid this is not fixable by Airflow project, as it's the limitation of Kubernetes that during scale down operation request you can't choose the "wait for completion of work" dynamic amount of time. The timing for graceful shutdown has to be configured at container start time, you can't update this when the container of celery worker is already running (the resource spec is immutable, you have to trigger "deploy" style operation to bounce container to do any changes).

I have been waiting for this for years, but K8s development community isn't working on it currently.
kubernetes/enhancements#2255 (comment)

Not sure about second proposed solution though. I'm not certain is Kubernetes sends upfront signal to remind that shutdown signal will come in grace period amount of seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:helm-chart Airflow Helm Chart area:logging area:providers kind:bug This is a clearly a bug provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

3 participants