Skip to content

Telegraf stops publishing metrics via wavefront output, all plugins take too long to collect - no meaningful error message #3710

@pberlowski

Description

@pberlowski

Bug report

After a certain amount of time (usually 5-10 minutes) under a load-test, the telegraf agent stops sending metrics onward via wavefront output. In addition to that, the log fills up with "took to long to collect" messages from all plugins.

No relevant error message is available in the logs.

This issue seems similar to #3629 however no aggregation plugin is in use.
There is a suspicion that the output may be too slow, even though the buffer doesn't seem to overfill and drop metrics.

Load test description

Measurements are submitted via http_listener from a load-generator host that replays a snapshot of data at selectable throughput. At 5000pps (fields), the error happens within minutes of starting the agent.

Relevant telegraf.conf:

http-proxy.conf.txt
telegraf.conf.txt

System info:

Telegraf version: 1.4.5
OS: Centos 7.3

Expected behavior:

Telegraf logs a meaningful error to describe a reason for overload and keeps buffering metrics.

Actual behavior:

Telegraf seizes operations silently and then starts logging failures to collect on time.

Additional info:

Stacktrace:
telegraf.stacktrace.txt

Hardware and software metrics screenshots for the failure window:
telegraf failure - hardware
telegraf failure - metrics
telegraf failure - network

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugunexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions