Skip to content

under reporting of count metrics when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks per service #3159

Open

Description

sorry can't paste agent info here as it's a fargate sidecar and i can't ssh into it.

Describe what happened:
under reporting of count metrics is observed when using a sidecar in aws fargate with metrics using DogStatsD and multiple tasks instances per services. when using a Service Type of replica and Number of tasks > 1 then count metrics are under reported by the 1/<Number of tasks>. this only occurs for Number of tasks > 1.

this happens as a result of two behaviors.

  1. metrics of type count only accept one count per sample interval for a single source. any more that are received are considered duplicates for that sample interval are considered extraneous and dropped. this is normal behavior.
  2. aws fargate assigns each task instance running for a single service and task definition to the same hostname parameter value. this is the current aws fargate behavior.
they seem to get a hostname of the format:
`fargate_task:arn:aws:ecs:<region>:<account>:task/prod/<task identifier>`

but the `<task identifier>` is not set to be unique.

as a result the count metrics from each service's task instance are considered as coming from the same source(hostname), and so one count metrics is processed for each sample interval with the remaining discarded. this reduces the summed count per interval to only count metric rather then the sum of multiple counts. if each of the service's task instance's had a unique hostname set by aws fargate then all the count metrics would be processed and summed together as the summed count for that sample interval.

while hostname is not set uniquely per task instance for a service, there is a parameter that is, the TaskARN and it's available to the container via the Task Metadata Endpoint https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-metadata-endpoint-v2.html .

so incorporating something like below that leverages the uniqueness of the TaskARN in the ecs entrypoint for the datadog agent https://github.com/DataDog/datadog-agent/blob/master/Dockerfiles/agent/entrypoint/50-ecs.sh#L13 would fix that by setting the DD_HOSTNAME to something unique per task instance.

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

this is based off of #2288 (comment) and https://github.com/aws/amazon-ecs-agent/issues/3#issuecomment-437643239 and we have confirmed this is working setting out dockerfile to:

FROM datadog/agent:6.10.1

COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh"]

and our /entrypoint.sh to

#!/bin/bash

if [[ -n "${ECS_FARGATE}" ]]; then
  taskid=$(curl 169.254.170.2/v2/metadata | grep TaskARN | awk -F/ '{print $NF}' | awk  -F\" '{print $1}')
  export DD_HOSTNAME=$taskid
fi

/init

Describe what you expected:
count metrics should be counted from each task instance run in fargate.

Steps to reproduce the issue:

  1. setup a fargate aws service that using a Service Type of replica and Number of tasks > 1 and utilizes the datadog container as a sidecar per https://www.datadoghq.com/blog/monitor-aws-fargate/
  2. have that service's container produce a count metric and have it upload that to datadog via the DogStatsD interface.
  3. that count metrics should be under reported

Additional environment details (Operating System, Cloud provider, etc):
AWS Fargate
DataDog agents datadog/agent:6.5.2 and datadog/agent:6.10.1 from dockerhub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions