-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
With logs persistence enabled (logs.persistence.enabled), all airflow components are storing their logs on a shared volume.
Default behavior is to clean up old logs after an amount of days.
When 2 or more containers attempt to clean the same logs, log-groomer containers crash with the following messages:
find: ‘/opt/airflow/logs/dag_id=findings_sync/run_id=scheduled__2024-01-19T00:00:00+00:00’: No such file or directory
rm: cannot remove '/opt/airflow/logs/dag_id=findings_sync_clean/run_id=scheduled__2024-01-19T00:00:00+00:00/task_id=start_sync_clean/attempt=1.log': Device or resource busy
This is a pretty similar bug than solved with this pull request: https://github.com/apache/airflow/pull/36050/files
The issue arises right on the previous command when either find or rm command fail.
Error message pointing to no such file or directory indicates another container has already removed the file.
Error message with device busy points that another container is performing an operation.
What you think should happen instead?
Failures on both find/rm commands can be safely ignored since the cleanup has already been done by another container.
How to reproduce
Install airflow via helm official chart and set logs.persistence.enabled true.
Then, it's just a matter of waiting few days generating logs (tasks running) until the race condition appears.
On my environment with multiple replicas per component, 5 dags and 72 tasks, this is happening every 1 or 2 days randomly:

I guess this will happen less with single replica and less tasks.
Operating System
Debian GNU/Linux 12 (bookworm)
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct