Improve how we record file size metric to avoid a race in our file lagging alert #1194

slim-bean · 2019-10-24T13:16:16Z

Currently we read file size by looking at the matched list of files from the glob pattern and then for each doing a stat on the file. We also report the number of bytes read in a file from the tailer.

Together these two metrics are used to create an alert if a tailer is lagging too far behind the size of the file (was helpful when we had issues in the early day with files stopping being tailed)

The problem is, when a file rolls, the sync() code will lookup the list of files to follow and see the new file, which we then report the size on (which will be small). Meanwhile the number of bytes read are reported from the existing file (which will be large).

The alert takes an abs() on the difference of these two metrics (because if tailing a file was stuck/broken the new file size would continuously be changing (and often less than the current file size), so it was necessary to look at the overall difference between their two sizes)

The alert does require a tailing diff to exist for a certain duration, but when log files roll very quickly it's not uncommon for this race to happen frequent enough to fire the alert as a false positive.

This PR updates the forked hpcloud/tail library to report the size based on the file handle and not the file name, and it also updates the sync() functions check on file size to use this new method if the file is currently being tailed.

This change has the upside of removing the race between reading the current file size by file name and how fast files are rolled and the bytes read size is reported by the tailer.

It does however introduce a new problem, where a file tailer could stop reading a file within a few bytes of the size of the file, and the alert would not fire because it doesn't exceed the threshold (although this situation would still be visible through metrics)

I think the tradeoffs here are worth it to hopefully remove the false positives we see on very quickly rolling log files.

pkg/promtail/targets/tailer.go

vendor/github.com/hpcloud/tail/tail.go

cyriltovena

LGTM with two minor nitpick I'd like to see changed.

…iler reports for size (or read the file size from the file name if no tailer exists yet) Update vendored (and forked) hpcloud tail library which includes a Size() function which reports size from the file handle and not file name

cyriltovena reviewed Oct 24, 2019

View reviewed changes

pkg/promtail/targets/tailer.go Outdated Show resolved Hide resolved

cyriltovena reviewed Oct 24, 2019

View reviewed changes

vendor/github.com/hpcloud/tail/tail.go Show resolved Hide resolved

cyriltovena approved these changes Oct 24, 2019

View reviewed changes

slim-bean added 3 commits October 24, 2019 10:40

needed to tidy?

6aef2c6

addressing code review comments

f2ee711

slim-bean force-pushed the improve-file-size-metric branch from 6aaefde to f2ee711 Compare October 24, 2019 14:43

slim-bean merged commit 94b9c4b into master Oct 24, 2019

rfratto deleted the improve-file-size-metric branch November 1, 2019 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve how we record file size metric to avoid a race in our file lagging alert #1194

Improve how we record file size metric to avoid a race in our file lagging alert #1194

slim-bean commented Oct 24, 2019

cyriltovena left a comment

Improve how we record file size metric to avoid a race in our file lagging alert #1194

Improve how we record file size metric to avoid a race in our file lagging alert #1194

Conversation

slim-bean commented Oct 24, 2019

cyriltovena left a comment

Choose a reason for hiding this comment