Skip to content

app health monitoring #1006

Open
Open
@daveoy

Description

@daveoy

has anyone thought of adding internal app metrics to show if problem daemons are having any issues?

following on from #1003 , i have added a few internal log events from various places inside the kmsg watcher so that i can track how often watchloops are starting / watchers are being revived

simple things like adding

	k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Entering watch loop",
		Timestamp: time.Now(),
	}

when we start the watch loop, or

        k.logCh <- &logtypes.Log{
		Message:   "[npd-internal] Reviving kmsg parser",
		Timestamp: time.Now(),
	}

whenever we revive the kmsg parser from inside the watcher. paired with config like:

{
  "plugin": "kmsg",
  "pluginConfig": {
    "revive": "true"
  },
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "bufferSize": 1000,
  "source": "kernel-monitor",
  "conditions": [
   ...
   ...
   ...
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "WatchLoopStarted",
      "pattern": "\\[npd-internal\\] Entering watch loop.*"
    },
    {
      "type": "temporary",
      "reason": "ParserRevived",
      "pattern": "\\[npd-internal\\] Reviving.*parser.*"
    },
   ...
   ...
   ...
  ]
}

we get prometheus metrics when the exporter is enabled (default) that look like:

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
   ...
   ...
   ...
problem_counter{reason="ParserRevived"} 1
   ...
   ...
   ...
problem_counter{reason="WatchLoopStarted"} 2
   ...
   ...
   ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions