[BUG] Applications can be scheduled onto a new node before dd socket is ready.

### Agent version

7.79.0

### Bug Report


The problem is that when a new node is created, the daemonset is not forced to be up before application workloads on that node. This means that an application's DD agent can initialize before the APM socket exists from the daemonset and this is never retried in the DD agent. This can lead to multi-hour losses in metrics and traces.

I think `system-node-critical` priority should be set for the daemonset, then the socket should be guaranteed to start before applications. Assuming that socket working is a check for the pod, which I wouldn't assume.

### Reproduction Steps

  :30:00  CronJob fires → pod created
  :30:00  FailedScheduling: "0/8 nodes: 8 Insufficient cpu" (no room!)
  :30:30  auto-scaler adds a brand new 9th node
  :31:20  pod gets scheduled onto the new node, starts running
  :31:20  DD tracer tries /var/run/datadog/apm.socket → NO SUCH FILE
          (DD agent DaemonSet is still pulling images on the new node)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Applications can be scheduled onto a new node before dd socket is ready. #52365

Agent version

Bug Report

Reproduction Steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Applications can be scheduled onto a new node before dd socket is ready. #52365

Description

Agent version

Bug Report

Reproduction Steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions