Metrics module #1850

lukasz-zimnoch · 2020-06-10T14:17:40Z

~~Depends on keep-network/keep-common#40~~

Summary

Here we introduce some basic metrics gathering and exposing them through an HTTP endpoint.

Metrics table

Metric	Description
connected_peers_count	Presents the count of all connected peers
connected_bootstrap_count	Presents the count of connected bootstraps from the list of bootstraps defined in the configuration
eth_connectivity	Presents whether the connection with ETH node is up (1 - yes, 0 - no)
libp2p_info	Presents some libp2p specific data

Example response from the /metrics endpoint

Presented output follows the Prometheus text-based exposition format

# TYPE connected_peers_count gauge
connected_peers_count 62 1592993341396

# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 10 1592993341396

# TYPE eth_connectivity gauge
eth_connectivity 1 1592993341421

libp2p_info{id="16Uiu2HAkuTUKNh6HkfvWBEkftZbqZHPHi3Kak5ZUygAxvsdQ2UgG"} 1

Added some configs options allowing to configure the metrics package.

Added IsConnected method to the network provider interface to allow make some network metrics.

knarz · 2020-06-11T00:40:02Z

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

And it was brought up in discord that being able to get the node id would be good, too.

I can also open an issue instead, if that's better.

mhluongo · 2020-06-11T12:27:03Z

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

Yeah it would.

And it was brought up in discord that being able to get the node id would be good, too.
I can also open an issue instead, if that's better.

@knarz I think this is better as a separate issue (though I strongly agree we need it) - want to open one and link back?

lukasz-zimnoch · 2020-06-12T14:09:18Z

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

Good idea! But, let's do it as a separate PR. The scope of this PR is adding an endpoint providing some application-level metrics intended to use with tools like Prometheus and I think we shouldn't mix it with Go-specific profiling data source for pprof.

knarz · 2020-06-12T15:26:55Z

Do you have a list with metrics you want to add or are you still looking for input?

lukasz-zimnoch · 2020-06-15T07:33:29Z

Do you have a list with metrics you want to add or are you still looking for input?

This PR is just a starting point so I'd appreciate all ideas for useful app-level metrics.

sthompson22 · 2020-06-16T13:39:11Z

@lukasz-zimnoch This is great - it would be helpful if you kept a table of metrics in the description of the PR. An example invocation would also be helpful.

sthompson22

This is a non-blocking comment, really mostly thoughts that bubbled up after reading what the metric does.

sthompson22 · 2020-06-16T14:18:18Z

pkg/metrics/metrics.go

+
+// ObserveConnectedBootstrapPercentage triggers an observation process of the
+// connected_bootstrap_percentage metric.
+func ObserveConnectedBootstrapPercentage(


This is a neat metric -

What stands out is that an operator must know how many peers are in their Peers list to reasonably assert the meaning of the value reported here.

Simple example: Person A has 2 peers in their list and they can't reach one of them, this metric will report 50% bootstraps connected. Person B has 20 peers in their list and you can't reach half of them this metric will report 50% bootstraps connected. It's reasonable to say In this scenario person A should take action to mitigate loosing the network in case they reboot, whereas person B can still sip their coffee in comfort.

It's reasonable to assume that an operator should know what their configurations contain - but I could see this being a possible point of confusion for "actionable data". Maybe "actionable data" isn't the point here?

Digging a bit closer to what I'm feeling here, what is it that we want to surface with these metrics (the scope)? It could be as simple as "surface an operators state in relation to its configuration and current state of the network". What is the intent of surfacing these metrics?

From an operator perspective I'm of the opinion that each metric surfaced should represent a potentially actionable state. We should take care to articulate that state in the description for each metric. "actionable" is highly opinionated however using this metric as an example we can say "If this reads 0, you cannot connect to the network if your machine restarts"

Once we make these statements it's easier to see if there is overlap between metrics, etc.

To be clear, I'm 100% ok leaving this as-is, mostly wanting to flesh out the motivation.

Regarding those Connected* metrics my intentions were as follows:

ConnectedPeersCount: I realized we often check this parameter in our logs treating it as a general clue about the network health and the magnitude of the possible load. Apart from that, if something goes wrong with the node, this parameter often falls suddenly. Hence, I think this is a good candidate for monitoring.

ConnectedBootstrapPercentage: This one is here to give a quick response to the question: "what's the condition of my configured network entry point?" and raise an alert if the condition is rather bad. If someone configures only two bootstraps, this is their own risk. This metric just shows the situation relative to the LibP2P.Peers property. I think we must be relative here because we can't easily say how much bootstraps on the list are enough as it depends on the network size. Second thing, we use percentage instead of count because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrap count is less than 5, it may be a true alert for a node with100 configured peers but not quite for a node with 10 configured peers. Defining an alert using percentage is easier and means the same for every node.

Nevertheless, as I mentioned above, this PR is just a starting point and its main scope is to provide metrics "framework" and some experimental metrics to give a good development base towards a full-fledged metrics system.

"what's the condition of my configured network entry point?"

I like this - and did not position it that way when working through the code initially.

Second thing, we use percentage instead of count because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrap count is less than 5, it may be a true alert for a node with100 configured peers but not quite for a node with 10 configured peers. Defining an alert using percentage is easier and means the same for every node.

I do agree with everything else but I don't think I agree with this single statement. 😆

If we want to answer the question of "what's the condition of my configured entry point", I think we should be very explicit and expect the operator to define their comfort level themselves. Just like @sthompson22's noted, 50% of 4 is something else than 50% of 200.

I like we have a separate metric for all connected peers and for bootstrap peers but I do think bootstrap peers metric should be a number, not a percentage and the operator should define the alert knowing their configuration and number of bootstrap peers defined there.

Ok, we can change connected_bootstrap_percentage to connected_bootstrap_count. Having it as percents seemed handier at the beginning but after summarizing all points, making it as explicit count appears as a better option now.

cmd/start.go

pdyraga · 2020-07-02T15:03:04Z

cmd/start.go

+	if tick := config.Metrics.Tick; tick != 0 {
+		observationTick = time.Duration(tick) * time.Second
+	} else {
+		observationTick = 60 * time.Second


I like 60sec tick for network metrics but I don't like it for ETH connectivity. I think we should have a more conservative tick for this one to do not affect available requests counts in case someone uses 3rd party provider for Ethereum. Maybe every 10 minutes is enough?

Also, I'd put those defaults as public constants in metrics package and use them from here.

We can change it this way but 10 minutes means also a huge delay regarding alerts. Isn't eth node connectivity problems something we want to know quickly?

Really depends on the setup. The majority of stakers will use a 3rd party provider for Ethereum. I expect they will be comfortable with 10 or even 30 minutes ticks. The ones running their own Ethereum clients will probably want to have 1 minute ticks. As long as those values are configurable in toml we are fine.

I've not looked, but if this allows alerting by predefined tick OR first failure, it seems like that'd cover most cases?

Here we just expose metrics, alerts are an external process. A typical situation regarding this metric will be as follow:

eth_connectivity metric is checked every 10 minutes internally by the client

An external monitoring tool (e.g. Prometheus) will call the metrics endpoint, for example, every minute but it will record the same eth_connectivity value most of the time.

An alert (e.g. defined in Prometheus) will be raised if the metric value drops below an arbitrary threshold for a defined amount of time, for example, 1 minute.

So, because of the above, we may have a situation when we detect connectivity problems even 10 minutes after they actually occurred, and then we receive an alert after the time mentioned in point 3.

But, as Piotr said, it is configurable so we can adjust those values according to our needs.

config/config.go

pkg/metrics/metrics.go

pdyraga · 2020-07-03T13:04:23Z

Works as advertised.

➜  ~ curl -D - http://localhost:8080/metrics
HTTP/1.1 200 OK
Date: Fri, 03 Jul 2020 13:04:00 GMT
Content-Length: 293
Content-Type: text/plain; charset=utf-8

# TYPE connected_peers_count gauge
connected_peers_count 2 1593781430137

# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 1 1593781430137

# TYPE eth_connectivity gauge
eth_connectivity 1 1593781010132

libp2p_info{id="16Uiu2HAm82kFx5PMHWUARfKhPhg9gQVdasiySjZaPPsNUjuE59TV"}

lukasz-zimnoch added 5 commits June 10, 2020 13:39

Add metrics config

aea5780

Added some configs options allowing to configure the metrics package.

Add IsConnected method

57c9f68

Added IsConnected method to the network provider interface to allow make some network metrics.

Bump keep-common version

3c83271

Add predefined system metrics

83c24af

Configurable observation tick

f0f42b3

lukasz-zimnoch mentioned this pull request Jun 10, 2020

Metrics module keep-network/keep-ecdsa#479

Merged

knarz mentioned this pull request Jun 11, 2020

Metrics add node id #1853

Closed

lukasz-zimnoch requested a review from pdyraga June 12, 2020 09:45

lukasz-zimnoch added 2 commits June 12, 2020 13:42

Expose libp2p node id metric

982cd26

Remove metrics identifier

1e0d643

sthompson22 reviewed Jun 16, 2020

View reviewed changes

lukasz-zimnoch marked this pull request as ready for review July 1, 2020 10:10

pdyraga reviewed Jul 2, 2020

View reviewed changes

lukasz-zimnoch added 4 commits July 3, 2020 10:11

Bump keep-core dependency

16350ad

Return bool instead of err from metrics.Initialize

b7064ba

Rework metric ticks

0551999

Refactor connected_bootstrap_percentage metric

85c7035

lukasz-zimnoch requested a review from pdyraga July 3, 2020 09:49

Merge branch 'master' into metrics

37c1e75

pdyraga reviewed Jul 3, 2020

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

lukasz-zimnoch added 2 commits July 3, 2020 14:35

Remove unnecessary printf

1cba929

Bump keep-common dependency

6c5dd76

pdyraga approved these changes Jul 3, 2020

View reviewed changes

pdyraga merged commit 766689b into master Jul 3, 2020

pdyraga deleted the metrics branch July 3, 2020 13:47

pdyraga mentioned this pull request Jul 10, 2020

Save last seen peers and use them as bootstrap upon restart #1877

Closed

pdyraga added this to the v1.3.0 milestone Jul 27, 2020

pdyraga added the 📟 client label Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics module #1850

Metrics module #1850

lukasz-zimnoch commented Jun 10, 2020 •

edited

Loading

knarz commented Jun 11, 2020

mhluongo commented Jun 11, 2020

lukasz-zimnoch commented Jun 12, 2020

knarz commented Jun 12, 2020

lukasz-zimnoch commented Jun 15, 2020

sthompson22 commented Jun 16, 2020

sthompson22 left a comment

sthompson22 Jun 16, 2020 •

edited

Loading

lukasz-zimnoch Jun 24, 2020 •

edited

Loading

sthompson22 Jun 24, 2020

pdyraga Jul 2, 2020

lukasz-zimnoch Jul 3, 2020

pdyraga Jul 2, 2020

pdyraga Jul 2, 2020

lukasz-zimnoch Jul 3, 2020 •

edited

Loading

pdyraga Jul 3, 2020

mhluongo Jul 3, 2020

lukasz-zimnoch Jul 3, 2020 •

edited

Loading

pdyraga commented Jul 3, 2020

Metrics module #1850

Metrics module #1850

Conversation

lukasz-zimnoch commented Jun 10, 2020 • edited Loading

knarz commented Jun 11, 2020

mhluongo commented Jun 11, 2020

lukasz-zimnoch commented Jun 12, 2020

knarz commented Jun 12, 2020

lukasz-zimnoch commented Jun 15, 2020

sthompson22 commented Jun 16, 2020

sthompson22 left a comment

Choose a reason for hiding this comment

sthompson22 Jun 16, 2020 • edited Loading

Choose a reason for hiding this comment

lukasz-zimnoch Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

sthompson22 Jun 24, 2020

Choose a reason for hiding this comment

pdyraga Jul 2, 2020

Choose a reason for hiding this comment

lukasz-zimnoch Jul 3, 2020

Choose a reason for hiding this comment

pdyraga Jul 2, 2020

Choose a reason for hiding this comment

pdyraga Jul 2, 2020

Choose a reason for hiding this comment

lukasz-zimnoch Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

pdyraga Jul 3, 2020

Choose a reason for hiding this comment

mhluongo Jul 3, 2020

Choose a reason for hiding this comment

lukasz-zimnoch Jul 3, 2020 • edited Loading

Choose a reason for hiding this comment

pdyraga commented Jul 3, 2020

lukasz-zimnoch commented Jun 10, 2020 •

edited

Loading

sthompson22 Jun 16, 2020 •

edited

Loading

lukasz-zimnoch Jun 24, 2020 •

edited

Loading

lukasz-zimnoch Jul 3, 2020 •

edited

Loading

lukasz-zimnoch Jul 3, 2020 •

edited

Loading