Skip to content

[BUG] Applications can be scheduled onto a new node before dd socket is ready. #52365

Description

@AkselAllas

Agent version

7.79.0

Bug Report

The problem is that when a new node is created, the daemonset is not forced to be up before application workloads on that node. This means that an application's DD agent can initialize before the APM socket exists from the daemonset and this is never retried in the DD agent. This can lead to multi-hour losses in metrics and traces.

I think system-node-critical priority should be set for the daemonset, then the socket should be guaranteed to start before applications. Assuming that socket working is a check for the pod, which I wouldn't assume.

Reproduction Steps

:30:00 CronJob fires → pod created
:30:00 FailedScheduling: "0/8 nodes: 8 Insufficient cpu" (no room!)
:30:30 auto-scaler adds a brand new 9th node
:31:20 pod gets scheduled onto the new node, starts running
:31:20 DD tracer tries /var/run/datadog/apm.socket → NO SUCH FILE
(DD agent DaemonSet is still pulling images on the new node)

Metadata

Metadata

Assignees

No one assigned

    Labels

    oss/0External contributions priority 0team/agent-apmtrace-agent

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions