Description
I have a process that is posting measurements with a timestamp linked to the data processed to telegraf. Usually, as it is working on (almost) realtime data, the timestamp are more or less current. However, occasionally, it can be asked to reprocess old data and the measurements will be send again but at the original timestamps.
If there is a default retention policy on the database, when reprocessing older data, all the metrics in Chronograf dashboard are delayed by a few minutes. (How much seems to vary between environment). When that process stops emitting past events, the dashboard still lag a bit before returning to normal.
Setting the retention policy is critical to reproduce. It causes partial writes at Influx level and telegraf seems a bit confused and appears to hold hostage other measurements even if issued by another input plugin. However I have not seen any missing measurement value when it gets back to normal.
Environment setup using docker on linux:
- Use the docker-compose.yml in attach (shameful rip off https://github.com/influxdata/TICK-docker/tree/master/1.2 with updated versions, little adjustments, and http_listener enabled)
- Use the telegraf configuration provided in attachment
- Start the environment using :
docker-compose up
- Create a default retention policy on telegraf database:
docker-compose run influxdb-cli -execute 'CREATE RETENTION POLICY realtime ON telegraf DURATION 4w REPLICATION 1 DEFAULT;'
- Open a browser on chronograf (localhost:8888), go to host list, use the "system" dashboard for your host.
- Setup refresh to "Every 10s" and timerange to "Past 15 minutes".
- Wait a few minutes to have data points collected
- Validate that the collected data is up-to-date (for example, use the tooltip on the CPU usage measurements to validate the time)
Then begin to reproduce:
- Post a few events in the past beyond the retention policy:
curl -i -XPOST "http://localhost:8186/write?db=telegraf&precision=ns" --data-binary "@test.txt"
- Wait 1 or 2 minutes and confirm that most of the measurements don't reach the dashboard anymore. You should have a gap on almost all charts (at least those who refresh their X axis).
If it does not work for you, try posting several times (5 times, 1 or 2 seconds apart seems to be enough for me).
The telegraf logs should reveal something along the line of:
E! InfluxDB Output Error: Response Error: Status Code [400], expected [204], [partial write: points beyond retention policy dropped=xx]
E! Error writing to output [influxdb]: Could not write to any InfluxDB server in cluster
The expected behaviour would be to have no delay at all (or close to none) in unrelated metrics, especially if coming from other plugins.
The actual behaviour: No new metrics value available during a (variable) time period (at least 4-5 min, sometimes way more).
Activity