Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EcsRunTaskOperator log output in UI does not contain all events #40875

Closed
1 of 2 tasks
yaningz opened this issue Jul 18, 2024 · 9 comments · Fixed by #41515
Closed
1 of 2 tasks

EcsRunTaskOperator log output in UI does not contain all events #40875

yaningz opened this issue Jul 18, 2024 · 9 comments · Fixed by #41515
Assignees
Labels
area:logging area:providers kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@yaningz
Copy link

yaningz commented Jul 18, 2024

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow==2.8.1
apache-airflow-providers-airbyte==3.5.1
apache-airflow-providers-amazon==8.16.0
apache-airflow-providers-databricks==6.0.0
apache-airflow-providers-docker==3.9.1
apache-airflow-providers-slack==8.5.1
apache-airflow-providers-snowflake==5.2.1

Apache Airflow version

2.8.1

Operating System

Amazon Linux

Deployment

Amazon (AWS) MWAA

Deployment details

Our MWAA environments are managed through Terraform. We install 3 providers outside of those available in the public providers.

What happened

The logs tab in the UI for EcsRunTaskOperator tasks sometimes misses events. We noted this behavior once we upgraded from version 2.5 to version 2.7 (and it has remained in version 2.8). We did not upgrade to 2.6 at any point, and newer releases are not yet available through MWAA.

We use the awslogs_group argument for the EcsRunTaskOperator to specify the logs group that the UI should pull from. The log streams in that ECS log group contains all events, but not all events in the log stream appear in the MWAA Task log group.

The following is an example from log streams that should share all logs (i.e. the MWAA Task logs should include all ECS logs). In the MWAA Task logs toward the bottom, events containing remaining DBT output logs are missing. The first several lines are fetched, and the rest of the logs that stream in later to the log stream are missed.

ECS logs:

mwaa-task-logstream.csv

MWAA Task logs:

ecs-task-logstream.csv

Our organization reached out to AWS MWAA for support relating to this issue and this is what we were told by their support regarding why this may be happening.

As per my colleague on the CloudWatch case, when you are using CloudWatch logs, other AWS services (E.g. MWAA/ECS) are actually using the PutLogEvents API to send their log events to the corresponding log group. Now when it comes to the Airflow service, the frequency as to when PutLogEvents is run for the logs to be pushed through to the specific log groups has likely been altered within the newer versions of Airflow.

This was not reproduceable within a local Airflow setup.

This is the call made to the EcsRunTaskOperator:

super().__init__(
    task_definition=dbt_ecs_name,
    cluster=ECS_CLUSTER,
    overrides={
        "containerOverrides": [
            {
                "name": "dbt-om1-task",
                "command": command_list,
            },
        ],
    },
    launch_type="FARGATE",
    network_configuration={
        "awsvpcConfiguration": {
            "subnets": [
                os.environ.get("AIRFLOW__VAR__PRIMARY_SUBNET_ID"),
                os.environ.get("AIRFLOW__VAR__SECONDARY_SUBNET_ID"),
            ],
            "securityGroups": [
                os.environ.get("AIRFLOW__VAR__SECURITY_GROUP_ID")
            ],
            "assignPublicIp": "DISABLED",
        },
    },
    awslogs_group="/ecs/airflow2/airflow2",
    awslogs_region="us-east-1",
    awslogs_stream_prefix="airflow2/dbt-om1-task",
    awslogs_fetch_interval=timedelta(seconds=30),
    propagate_tags="TASK_DEFINITION",
    **kwargs,
)

What you think should happen instead

All log events in the ECS log group should be pulled forward by the MWAA Task log group using the task_log_fetcher.

How to reproduce

We are using a large (~2GB) custom DBT image inside our taskdef. This has happened with every execution of the EcsRunTaskOperator against our custom DBT image.

After triggering a task using the EcsRunTaskOperator, the DBT output logs should stream directly from the ECS log stream to the MWAA Task log stream.

Anything else

This has been observed in our MWAA environment against both apache-airflow==2.7.2 and apache-airflow==2.8.1. When upgrading, we did not upgrade to a 2.6 version. This issue may be related to or caused by the same root cause as #39323.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@yaningz yaningz added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jul 18, 2024
Copy link

boring-cyborg bot commented Jul 18, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added the provider:amazon-aws AWS/Amazon - related issues label Jul 18, 2024
@shahar1 shahar1 added area:logging and removed needs-triage label for new issues that we didn't triage yet labels Jul 19, 2024
@vincbeck
Copy link
Contributor

This was not reproduceable within a local Airflow setup.

It seems then it is an issue on MWAA side?

@yaningz
Copy link
Author

yaningz commented Aug 13, 2024

It seems then it is an issue on MWAA side?

We worked with AWS support, who reached out to MWAA internally, and they believe that the issue is because of how the AWS provider is pushing logs to Cloudwatch for the ECS log group. Logs within the same logstream for our ECS task are being written to the same timestamp regardless of their execution time, possibly because they are triggered by the same call inside the task.

This was not previously the case (e.g. with Airflow 2.5), and is the suspected cause of the error we're seeing.

@vincbeck
Copy link
Contributor

Thanks for the context, I'll look into it

@vincbeck vincbeck self-assigned this Aug 13, 2024
@vincbeck
Copy link
Contributor

After investigation here is what I think. The logs after a certain point are not in the MWAA task logs because the task stopped before all the logs could have been pushed to the MWAA task logs. Basically the task created a lot of logs in the ECS task logs and stopped right after. In that case, the Airflow task stops as well and dont have time to push logs from ECS to MWAA

@yaningz
Copy link
Author

yaningz commented Aug 15, 2024

@vincbeck What is the expected cadence for logs getting pushed to the ECS log group in Cloudwatch, and the frequency for logs getting pulled in MWAA? I can see ECS tasks that run for a few minutes according to the logs for the ECS task, but are missing logs in the mWAA log group after the first couple messages. I've attached an example to this comment of a new pair of logs for ECS vs MWAA-Task. The ECS task logs include timestamps from inside the container and show that the task ran for ~4 minutes, though all log timestamps are the same value.

20240814_ecs_task_log.csv
20240814_mwaa_task_log.csv

@vincbeck
Copy link
Contributor

You are right, the issue might be logs with the same timestamp have trouble being sent to the MWAA task logs. Keep investigating

@vincbeck
Copy link
Contributor

Yep, that's the root cause ... Somehow when the timestamp is the same, it does not log

@vincbeck
Copy link
Contributor

Fix #41515

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:logging area:providers kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants