Skip to content

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

Open
@alexandergunnarson

Description

Agent Environment

public.ecr.aws/datadog/agent:latest-jmx — version updates to latest every time the agent is booted, which is daily or more

Describe what happened:

Yesterday we found that Datadog Agent has been responsible for our near-daily system failures over the past 6 months, causing us untold amounts of engineering time and certainly losing us a large number of customers, as well as their trust.

We’ve repeatedly observed that seemingly random iowait spikes would spell certain death to our user-facing containers. First it would cause lockup of the ECS containers, and then lockup of the underlying EC2 machine, often requiring manual termination.

Before yesterday, we could not isolate the cause. We naively assumed it was our code, because we had no reason to suspect Datadog Agent, and furthermore, had no process-level visibility. We incorrectly assumed that our comprehensive host and JVM dashboards, along with logs and traces, would tell us all we needed to know.

Over the past few months we’ve worked to eliminate all possible causes of iowait within our user-facing containers, including nearly all disk usage. We transitioned from gp2 to gp3 disks and upgraded them to 500 MiB/s throughput and 5000 IOPS (far exceeding the previous configuration). The iowait problem continued to happen and our site continued to go down.

Once we turned on process-level visibility via Datadog Agent configuration yesterday, we realized that the iowait was caused by sudden massive (>6000 IOPS) disk reads on the part of Datadog Agent. We upped IOPS this morning to 10000, and even this ceiling is not high enough.

While Datadog Agent has been tremendously helpful to us, we consider this iowait issue a serious defect.

How can we resolve this issue?

Thanks for your time.

Describe what you expected:

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

ECS-optimized AMI on EC2

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions