Agent version
7.79.0
Bug Report
The problem is that when a new node is created, the daemonset is not forced to be up before application workloads on that node. This means that an application's DD agent can initialize before the APM socket exists from the daemonset and this is never retried in the DD agent. This can lead to multi-hour losses in metrics and traces.
I think system-node-critical priority should be set for the daemonset, then the socket should be guaranteed to start before applications. Assuming that socket working is a check for the pod, which I wouldn't assume.
Reproduction Steps
:30:00 CronJob fires → pod created
:30:00 FailedScheduling: "0/8 nodes: 8 Insufficient cpu" (no room!)
:30:30 auto-scaler adds a brand new 9th node
:31:20 pod gets scheduled onto the new node, starts running
:31:20 DD tracer tries /var/run/datadog/apm.socket → NO SUCH FILE
(DD agent DaemonSet is still pulling images on the new node)
Agent version
7.79.0
Bug Report
The problem is that when a new node is created, the daemonset is not forced to be up before application workloads on that node. This means that an application's DD agent can initialize before the APM socket exists from the daemonset and this is never retried in the DD agent. This can lead to multi-hour losses in metrics and traces.
I think
system-node-criticalpriority should be set for the daemonset, then the socket should be guaranteed to start before applications. Assuming that socket working is a check for the pod, which I wouldn't assume.Reproduction Steps
:30:00 CronJob fires → pod created
:30:00 FailedScheduling: "0/8 nodes: 8 Insufficient cpu" (no room!)
:30:30 auto-scaler adds a brand new 9th node
:31:20 pod gets scheduled onto the new node, starts running
:31:20 DD tracer tries /var/run/datadog/apm.socket → NO SUCH FILE
(DD agent DaemonSet is still pulling images on the new node)