Improve how we record file size metric to avoid a race in our file lagging alert #1194
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently we read file size by looking at the matched list of files from the glob pattern and then for each doing a stat on the file. We also report the number of bytes read in a file from the tailer.
Together these two metrics are used to create an alert if a tailer is lagging too far behind the size of the file (was helpful when we had issues in the early day with files stopping being tailed)
The problem is, when a file rolls, the sync() code will lookup the list of files to follow and see the new file, which we then report the size on (which will be small). Meanwhile the number of bytes read are reported from the existing file (which will be large).
The alert takes an abs() on the difference of these two metrics (because if tailing a file was stuck/broken the new file size would continuously be changing (and often less than the current file size), so it was necessary to look at the overall difference between their two sizes)
The alert does require a tailing diff to exist for a certain duration, but when log files roll very quickly it's not uncommon for this race to happen frequent enough to fire the alert as a false positive.
This PR updates the forked hpcloud/tail library to report the size based on the file handle and not the file name, and it also updates the sync() functions check on file size to use this new method if the file is currently being tailed.
This change has the upside of removing the race between reading the current file size by file name and how fast files are rolled and the bytes read size is reported by the tailer.
It does however introduce a new problem, where a file tailer could stop reading a file within a few bytes of the size of the file, and the alert would not fire because it doesn't exceed the threshold (although this situation would still be visible through metrics)
I think the tradeoffs here are worth it to hopefully remove the false positives we see on very quickly rolling log files.