Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promtail: Add metrics for journal target #6105

Merged
merged 4 commits into from
Jun 11, 2022
Merged

Promtail: Add metrics for journal target #6105

merged 4 commits into from
Jun 11, 2022

Conversation

RutgerKe
Copy link
Member

@RutgerKe RutgerKe commented May 5, 2022

What this PR does / why we need it:

There are no metrics for the journal target yet, this adds two basic ones:

  • Total number of lines processed
  • Total number of errors processing lines (with a reason in a label)

Because of the way the journal works, tailing one file with many possible processes writing to it, I chose note to expose counts per unit.

Which issue(s) this PR fixes:
Fixes #5554

Special notes for your reviewer:
Doing this PR mostly to learn, not because I really need this feature. So all feedback/opinions are welcome!
I looked at the other targets and did not see any tests for the metrics produced/no pattern to copy here, but I'd be happy to figure out how to add them if requested.

Checklist

  • Documentation added
  • Tests updated
  • Is this an important fix or new feature? Add an entry in the CHANGELOG.md.
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

@CLAassistant
Copy link

CLAassistant commented May 5, 2022

CLA assistant check
All committers have signed the CLA.

@RutgerKe RutgerKe requested a review from mem May 23, 2022 14:56
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

There are no metrics for the journal target yet, this adds two basic
ones:
- Total number of lines processed
- Total number of errors processing lines (with a reason)

Because of the way the journal works, tailing one file with many
possible process writing to it, I chose note to expose counts per unit.
@RutgerKe RutgerKe changed the title Add metrics for journal target Promtail: Add metrics for journal target Jun 2, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@RutgerKe RutgerKe marked this pull request as ready for review June 2, 2022 12:24
@RutgerKe RutgerKe requested a review from a team as a code owner June 2, 2022 12:24
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Curious about the decision to not include unit as a label though, could you elaborate on that a bit? I feel like if I was getting errors, I would want to know which unit was emitting these bad logs so I could try to fix it, or remove it from my scrape targets. What value are the error metrics without that, are there other ways to investigate what's wrong?

@@ -264,6 +269,7 @@ func (t *JournalTarget) formatter(entry *sdjournal.JournalEntry) (string, error)
bb, err := json.Marshal(entry.Fields)
if err != nil {
level.Error(t.logger).Log("msg", "could not marshal journal fields to JSON", "err", err)
t.metrics.journalErrors.WithLabelValues("cannot_marshal").Inc()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to move these reasons to constants so someone can quickly see how many there are? we want to make sure not to explode cardinality too much by adding too many of them, but 3 seems pretty reasonable for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Yea would prefer some journalErrorType constants with values cannot_marshal, no_message and empty_labels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted it to string constants in the metrics file, similar to what is done in the validate package of loki. If you had something else in mind, like really making a new type I could also do that.

I did find the use of literal strings in the filetarget of promtail, so there is some inconsistency.

@RutgerKe
Copy link
Member Author

RutgerKe commented Jun 7, 2022

Thanks for the contribution! Curious about the decision to not include unit as a label though, could you elaborate on that a bit? I feel like if I was getting errors, I would want to know which unit was emitting these bad logs so I could try to fix it, or remove it from my scrape targets. What value are the error metrics without that, are there other ways to investigate what's wrong?

Sure, I was mainly thinking about the number of units: this can easily be in the hundreds, I'm not sure if we should create a label for each of them? It kind of goes towards analyzing the content of the data, instead of providing metrics if the software is working. I personally like providing as much insights into data as possible, but should this be a concern of promtail? This might be better done in grafana by querying the loki data itself?

I do get your use case, and I'm happy to add them to all metrics, a comprise could be to just add them to the errors.

Copy link
Collaborator

@trevorwhitney trevorwhitney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy your argument for not including the unit in the metric labels. How would you feel about emitting a log line with the unit though for the purposes of investigating the problem?

clients/pkg/promtail/targets/journal/journaltarget.go Outdated Show resolved Hide resolved
clients/pkg/promtail/targets/journal/journaltarget.go Outdated Show resolved Hide resolved
clients/pkg/promtail/targets/journal/journaltarget.go Outdated Show resolved Hide resolved
Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @RutgerKe :)

Just few minor nits. Let me know once you fix those. I can merge.

clients/cmd/promtail/promtail-journal.yaml Outdated Show resolved Hide resolved
@@ -264,6 +269,7 @@ func (t *JournalTarget) formatter(entry *sdjournal.JournalEntry) (string, error)
bb, err := json.Marshal(entry.Fields)
if err != nil {
level.Error(t.logger).Log("msg", "could not marshal journal fields to JSON", "err", err)
t.metrics.journalErrors.WithLabelValues("cannot_marshal").Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Yea would prefer some journalErrorType constants with values cannot_marshal, no_message and empty_labels.

clients/pkg/promtail/targets/journal/journaltarget_test.go Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jun 9, 2022
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0.3%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@RutgerKe
Copy link
Member Author

RutgerKe commented Jun 9, 2022

Thanks for the reviews, I've included your comments. Turns out I couldn't really trigger the error returned from the marshal function, so I deleted that counter. (Though: If you have any idea how to trigger it / in what situations this occurs, let me know and I can add it back in).

@kavirajk
Copy link
Contributor

kavirajk commented Jun 9, 2022

Turns out I couldn't really trigger the error returned from the marshal function, so I deleted that counter. (Though: If you have any idea how to trigger it / in what situations this occurs, let me know and I can add it back in).

Given that filed is map[string]string I think it always marshall well into JSON. So I also think we may not need that error count.

Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@kavirajk kavirajk enabled auto-merge (squash) June 10, 2022 07:36
@kavirajk kavirajk merged commit 1794a76 into main Jun 11, 2022
@kavirajk kavirajk deleted the rk/metrics-journald branch June 11, 2022 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Promtail missing internal telemetry about systemd-journal target
5 participants