Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics module #1850

Merged
merged 14 commits into from
Jul 3, 2020
Merged

Metrics module #1850

merged 14 commits into from
Jul 3, 2020

Conversation

lukasz-zimnoch
Copy link
Member

@lukasz-zimnoch lukasz-zimnoch commented Jun 10, 2020

Depends on keep-network/keep-common#40

Summary

Here we introduce some basic metrics gathering and exposing them through an HTTP endpoint.

Metrics table

Metric Description
connected_peers_count Presents the count of all connected peers
connected_bootstrap_count Presents the count of connected bootstraps from the list of bootstraps defined in the configuration
eth_connectivity Presents whether the connection with ETH node is up (1 - yes, 0 - no)
libp2p_info Presents some libp2p specific data

Example response from the /metrics endpoint

Presented output follows the Prometheus text-based exposition format

# TYPE connected_peers_count gauge
connected_peers_count 62 1592993341396

# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 10 1592993341396

# TYPE eth_connectivity gauge
eth_connectivity 1 1592993341421

libp2p_info{id="16Uiu2HAkuTUKNh6HkfvWBEkftZbqZHPHi3Kak5ZUygAxvsdQ2UgG"} 1

Added some configs options allowing
to configure the metrics package.
Added IsConnected method to the
network provider interface to allow
make some network metrics.
@knarz
Copy link

knarz commented Jun 11, 2020

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

And it was brought up in discord that being able to get the node id would be good, too.

I can also open an issue instead, if that's better.

@mhluongo
Copy link
Member

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

Yeah it would.

And it was brought up in discord that being able to get the node id would be good, too.
I can also open an issue instead, if that's better.

@knarz I think this is better as a separate issue (though I strongly agree we need it) - want to open one and link back?

@knarz knarz mentioned this pull request Jun 11, 2020
@lukasz-zimnoch
Copy link
Member Author

It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode.

Good idea! But, let's do it as a separate PR. The scope of this PR is adding an endpoint providing some application-level metrics intended to use with tools like Prometheus and I think we shouldn't mix it with Go-specific profiling data source for pprof.

@knarz
Copy link

knarz commented Jun 12, 2020

Do you have a list with metrics you want to add or are you still looking for input?

@lukasz-zimnoch
Copy link
Member Author

Do you have a list with metrics you want to add or are you still looking for input?

This PR is just a starting point so I'd appreciate all ideas for useful app-level metrics.

@sthompson22
Copy link
Contributor

@lukasz-zimnoch This is great - it would be helpful if you kept a table of metrics in the description of the PR. An example invocation would also be helpful.

Copy link
Contributor

@sthompson22 sthompson22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-blocking comment, really mostly thoughts that bubbled up after reading what the metric does.


// ObserveConnectedBootstrapPercentage triggers an observation process of the
// connected_bootstrap_percentage metric.
func ObserveConnectedBootstrapPercentage(
Copy link
Contributor

@sthompson22 sthompson22 Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat metric -

What stands out is that an operator must know how many peers are in their Peers list to reasonably assert the meaning of the value reported here.

Simple example: Person A has 2 peers in their list and they can't reach one of them, this metric will report 50% bootstraps connected. Person B has 20 peers in their list and you can't reach half of them this metric will report 50% bootstraps connected. It's reasonable to say In this scenario person A should take action to mitigate loosing the network in case they reboot, whereas person B can still sip their coffee in comfort.

It's reasonable to assume that an operator should know what their configurations contain - but I could see this being a possible point of confusion for "actionable data". Maybe "actionable data" isn't the point here?

Digging a bit closer to what I'm feeling here, what is it that we want to surface with these metrics (the scope)? It could be as simple as "surface an operators state in relation to its configuration and current state of the network". What is the intent of surfacing these metrics?

From an operator perspective I'm of the opinion that each metric surfaced should represent a potentially actionable state. We should take care to articulate that state in the description for each metric. "actionable" is highly opinionated however using this metric as an example we can say "If this reads 0, you cannot connect to the network if your machine restarts"

Once we make these statements it's easier to see if there is overlap between metrics, etc.

To be clear, I'm 100% ok leaving this as-is, mostly wanting to flesh out the motivation.

Copy link
Member Author

@lukasz-zimnoch lukasz-zimnoch Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding those Connected* metrics my intentions were as follows:

  • ConnectedPeersCount: I realized we often check this parameter in our logs treating it as a general clue about the network health and the magnitude of the possible load. Apart from that, if something goes wrong with the node, this parameter often falls suddenly. Hence, I think this is a good candidate for monitoring.
  • ConnectedBootstrapPercentage: This one is here to give a quick response to the question: "what's the condition of my configured network entry point?" and raise an alert if the condition is rather bad. If someone configures only two bootstraps, this is their own risk. This metric just shows the situation relative to the LibP2P.Peers property. I think we must be relative here because we can't easily say how much bootstraps on the list are enough as it depends on the network size. Second thing, we use percentage instead of count because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrap count is less than 5, it may be a true alert for a node with100 configured peers but not quite for a node with 10 configured peers. Defining an alert using percentage is easier and means the same for every node.

Nevertheless, as I mentioned above, this PR is just a starting point and its main scope is to provide metrics "framework" and some experimental metrics to give a good development base towards a full-fledged metrics system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"what's the condition of my configured network entry point?"

I like this - and did not position it that way when working through the code initially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second thing, we use percentage instead of count because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrap count is less than 5, it may be a true alert for a node with100 configured peers but not quite for a node with 10 configured peers. Defining an alert using percentage is easier and means the same for every node.

I do agree with everything else but I don't think I agree with this single statement. 😆

If we want to answer the question of "what's the condition of my configured entry point", I think we should be very explicit and expect the operator to define their comfort level themselves. Just like @sthompson22's noted, 50% of 4 is something else than 50% of 200.

I like we have a separate metric for all connected peers and for bootstrap peers but I do think bootstrap peers metric should be a number, not a percentage and the operator should define the alert knowing their configuration and number of bootstrap peers defined there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can change connected_bootstrap_percentage to connected_bootstrap_count. Having it as percents seemed handier at the beginning but after summarizing all points, making it as explicit count appears as a better option now.

@lukasz-zimnoch lukasz-zimnoch marked this pull request as ready for review July 1, 2020 10:10
cmd/start.go Show resolved Hide resolved
cmd/start.go Outdated
if tick := config.Metrics.Tick; tick != 0 {
observationTick = time.Duration(tick) * time.Second
} else {
observationTick = 60 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 60sec tick for network metrics but I don't like it for ETH connectivity. I think we should have a more conservative tick for this one to do not affect available requests counts in case someone uses 3rd party provider for Ethereum. Maybe every 10 minutes is enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'd put those defaults as public constants in metrics package and use them from here.

Copy link
Member Author

@lukasz-zimnoch lukasz-zimnoch Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change it this way but 10 minutes means also a huge delay regarding alerts. Isn't eth node connectivity problems something we want to know quickly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really depends on the setup. The majority of stakers will use a 3rd party provider for Ethereum. I expect they will be comfortable with 10 or even 30 minutes ticks. The ones running their own Ethereum clients will probably want to have 1 minute ticks. As long as those values are configurable in toml we are fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not looked, but if this allows alerting by predefined tick OR first failure, it seems like that'd cover most cases?

Copy link
Member Author

@lukasz-zimnoch lukasz-zimnoch Jul 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we just expose metrics, alerts are an external process. A typical situation regarding this metric will be as follow:

  1. eth_connectivity metric is checked every 10 minutes internally by the client
  2. An external monitoring tool (e.g. Prometheus) will call the metrics endpoint, for example, every minute but it will record the same eth_connectivity value most of the time.
  3. An alert (e.g. defined in Prometheus) will be raised if the metric value drops below an arbitrary threshold for a defined amount of time, for example, 1 minute.

So, because of the above, we may have a situation when we detect connectivity problems even 10 minutes after they actually occurred, and then we receive an alert after the time mentioned in point 3.

But, as Piotr said, it is configurable so we can adjust those values according to our needs.

config/config.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@pdyraga
Copy link
Member

pdyraga commented Jul 3, 2020

Works as advertised.

➜  ~ curl -D - http://localhost:8080/metrics
HTTP/1.1 200 OK
Date: Fri, 03 Jul 2020 13:04:00 GMT
Content-Length: 293
Content-Type: text/plain; charset=utf-8

# TYPE connected_peers_count gauge
connected_peers_count 2 1593781430137

# TYPE connected_bootstrap_count gauge
connected_bootstrap_count 1 1593781430137

# TYPE eth_connectivity gauge
eth_connectivity 1 1593781010132

libp2p_info{id="16Uiu2HAm82kFx5PMHWUARfKhPhg9gQVdasiySjZaPPsNUjuE59TV"}

@pdyraga pdyraga merged commit 766689b into master Jul 3, 2020
@pdyraga pdyraga deleted the metrics branch July 3, 2020 13:47
@pdyraga pdyraga added this to the v1.3.0 milestone Jul 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants