Description
openedon Mar 14, 2019
sorry can't paste agent info here as it's a fargate sidecar and i can't ssh into it.
Describe what happened:
under reporting of count metrics is observed when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks instances per services. when using a Service Type
of replica and Number of tasks
> 1 then count metrics are under reported by the 1/<Number of tasks>
. this only occurs for Number of tasks
> 1.
this happens as a result of two behaviors.
- metrics of type count only accept one count per sample interval for a single source. any more that are received are considered duplicates for that sample interval are considered extraneous and dropped. this is normal behavior.
- aws fargate assigns each task instance running for a single service and task definition to the same
hostname
parameter value. this is the current aws fargate behavior.
they seem to get a hostname of the format:
`fargate_task:arn:aws:ecs:<region>:<account>:task/prod/<task identifier>`
but the `<task identifier>` is not set to be unique.
as a result the count metrics from each service's task instance are considered as coming from the same source(hostname
), and so one count metrics is processed for each sample interval with the remaining discarded. this reduces the summed count per interval to only count metric rather then the sum of multiple counts. if each of the service's task instance's had a unique hostname set by aws fargate then all the count metrics would be processed and summed together as the summed count for that sample interval.
while hostname
is not set uniquely per task instance for a service, there is a parameter that is, the TaskARN
and it's available to the container via the Task Metadata Endpoint
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v2.html .
so incorporating something like below that leverages the uniqueness of the TaskARN
in the ecs entrypoint for the datadog agent https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-ecs.sh#L13 would fix that by setting the DD_HOSTNAME
to something unique per task instance.
if [[ -n "${ECS_FARGATE}" ]]; then
taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk -F\" '{print $1}')
export DD_HOSTNAME=$taskid
fi
this is based off of #2288 (comment) and https://github.com/aws/amazon-ecs-agent/issues/3#issuecomment-437643239 and we have confirmed this is working setting out dockerfile to:
FROM datadog/agent:6.10.1
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
and our /entrypoint.sh
to
#!/bin/bash
if [[ -n "${ECS_FARGATE}" ]]; then
taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk -F\" '{print $1}')
export DD_HOSTNAME=$taskid
fi
/init
Describe what you expected:
count metrics should be counted from each task instance run in fargate.
Steps to reproduce the issue:
- setup a fargate aws service that using a
Service Type
of replica andNumber of tasks
> 1 and utilizes the datadog container as a sidecar per https://www.datadoghq.com/blog/monitor-aws-fargate/ - have that service's container produce a count metric and have it upload that to datadog via the DogStatsD interface.
- that count metrics should be under reported
Additional environment details (Operating System, Cloud provider, etc):
AWS Fargate
DataDog agents datadog/agent:6.5.2 and datadog/agent:6.10.1 from dockerhub