Skip to content

log-groomer container crashing with .Values.logs.persistence enabled  #37220

@arovira

Description

@arovira

Apache Airflow version

2.8.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

With logs persistence enabled (logs.persistence.enabled), all airflow components are storing their logs on a shared volume.

Default behavior is to clean up old logs after an amount of days.
When 2 or more containers attempt to clean the same logs, log-groomer containers crash with the following messages:

find: ‘/opt/airflow/logs/dag_id=findings_sync/run_id=scheduled__2024-01-19T00:00:00+00:00’: No such file or directory
rm: cannot remove '/opt/airflow/logs/dag_id=findings_sync_clean/run_id=scheduled__2024-01-19T00:00:00+00:00/task_id=start_sync_clean/attempt=1.log': Device or resource busy

This is a pretty similar bug than solved with this pull request: https://github.com/apache/airflow/pull/36050/files

The issue arises right on the previous command when either find or rm command fail.
Error message pointing to no such file or directory indicates another container has already removed the file.
Error message with device busy points that another container is performing an operation.

What you think should happen instead?

Failures on both find/rm commands can be safely ignored since the cleanup has already been done by another container.

How to reproduce

Install airflow via helm official chart and set logs.persistence.enabled true.

Then, it's just a matter of waiting few days generating logs (tasks running) until the race condition appears.
On my environment with multiple replicas per component, 5 dags and 72 tasks, this is happening every 1 or 2 days randomly:
image

I guess this will happen less with single replica and less tasks.

Operating System

Debian GNU/Linux 12 (bookworm)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions