-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promtail: Add metrics for journal target #6105
Conversation
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There are no metrics for the journal target yet, this adds two basic ones: - Total number of lines processed - Total number of errors processing lines (with a reason) Because of the way the journal works, tailing one file with many possible process writing to it, I chose note to expose counts per unit.
bc2e632
to
ec6a0e2
Compare
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Curious about the decision to not include unit as a label though, could you elaborate on that a bit? I feel like if I was getting errors, I would want to know which unit was emitting these bad logs so I could try to fix it, or remove it from my scrape targets. What value are the error metrics without that, are there other ways to investigate what's wrong?
@@ -264,6 +269,7 @@ func (t *JournalTarget) formatter(entry *sdjournal.JournalEntry) (string, error) | |||
bb, err := json.Marshal(entry.Fields) | |||
if err != nil { | |||
level.Error(t.logger).Log("msg", "could not marshal journal fields to JSON", "err", err) | |||
t.metrics.journalErrors.WithLabelValues("cannot_marshal").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to move these reasons to constants so someone can quickly see how many there are? we want to make sure not to explode cardinality too much by adding too many of them, but 3 seems pretty reasonable for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Yea would prefer some journalErrorType
constants with values cannot_marshal
, no_message
and empty_labels
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted it to string constants in the metrics file, similar to what is done in the validate package of loki. If you had something else in mind, like really making a new type I could also do that.
I did find the use of literal strings in the filetarget of promtail, so there is some inconsistency.
Sure, I was mainly thinking about the number of units: this can easily be in the hundreds, I'm not sure if we should create a label for each of them? It kind of goes towards analyzing the content of the data, instead of providing metrics if the software is working. I personally like providing as much insights into data as possible, but should this be a concern of promtail? This might be better done in grafana by querying the loki data itself? I do get your use case, and I'm happy to add them to all metrics, a comprise could be to just add them to the errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I buy your argument for not including the unit in the metric labels. How would you feel about emitting a log line with the unit though for the purposes of investigating the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @RutgerKe :)
Just few minor nits. Let me know once you fix those. I can merge.
@@ -264,6 +269,7 @@ func (t *JournalTarget) formatter(entry *sdjournal.JournalEntry) (string, error) | |||
bb, err := json.Marshal(entry.Fields) | |||
if err != nil { | |||
level.Error(t.logger).Log("msg", "could not marshal journal fields to JSON", "err", err) | |||
t.metrics.journalErrors.WithLabelValues("cannot_marshal").Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Yea would prefer some journalErrorType
constants with values cannot_marshal
, no_message
and empty_labels
.
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0.3%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
Thanks for the reviews, I've included your comments. Turns out I couldn't really trigger the error returned from the marshal function, so I deleted that counter. (Though: If you have any idea how to trigger it / in what situations this occurs, let me know and I can add it back in). |
Given that filed is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
What this PR does / why we need it:
There are no metrics for the journal target yet, this adds two basic ones:
Because of the way the journal works, tailing one file with many possible processes writing to it, I chose note to expose counts per unit.
Which issue(s) this PR fixes:
Fixes #5554
Special notes for your reviewer:
Doing this PR mostly to learn, not because I really need this feature. So all feedback/opinions are welcome!
I looked at the other targets and did not see any tests for the metrics produced/no pattern to copy here, but I'd be happy to figure out how to add them if requested.
Checklist
CHANGELOG.md
.docs/sources/upgrading/_index.md