-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics module #1850
Metrics module #1850
Conversation
Added some configs options allowing to configure the metrics package.
Added IsConnected method to the network provider interface to allow make some network metrics.
It might be helpful to expose https://golang.org/pkg/net/http/pprof/ as well, maybe just if ran in debug mode. And it was brought up in discord that being able to get the node id would be good, too. I can also open an issue instead, if that's better. |
Yeah it would.
@knarz I think this is better as a separate issue (though I strongly agree we need it) - want to open one and link back? |
Good idea! But, let's do it as a separate PR. The scope of this PR is adding an endpoint providing some application-level metrics intended to use with tools like Prometheus and I think we shouldn't mix it with Go-specific profiling data source for |
Do you have a list with metrics you want to add or are you still looking for input? |
This PR is just a starting point so I'd appreciate all ideas for useful app-level metrics. |
@lukasz-zimnoch This is great - it would be helpful if you kept a table of metrics in the description of the PR. An example invocation would also be helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a non-blocking comment, really mostly thoughts that bubbled up after reading what the metric does.
pkg/metrics/metrics.go
Outdated
|
||
// ObserveConnectedBootstrapPercentage triggers an observation process of the | ||
// connected_bootstrap_percentage metric. | ||
func ObserveConnectedBootstrapPercentage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a neat metric -
What stands out is that an operator must know how many peers are in their Peers
list to reasonably assert the meaning of the value reported here.
Simple example: Person A has 2 peers in their list and they can't reach one of them, this metric will report 50% bootstraps connected. Person B has 20 peers in their list and you can't reach half of them this metric will report 50% bootstraps connected. It's reasonable to say In this scenario person A should take action to mitigate loosing the network in case they reboot, whereas person B can still sip their coffee in comfort.
It's reasonable to assume that an operator should know what their configurations contain - but I could see this being a possible point of confusion for "actionable data". Maybe "actionable data" isn't the point here?
Digging a bit closer to what I'm feeling here, what is it that we want to surface with these metrics (the scope)? It could be as simple as "surface an operators state in relation to its configuration and current state of the network". What is the intent of surfacing these metrics?
From an operator perspective I'm of the opinion that each metric surfaced should represent a potentially actionable state. We should take care to articulate that state in the description for each metric. "actionable" is highly opinionated however using this metric as an example we can say "If this reads 0, you cannot connect to the network if your machine restarts"
Once we make these statements it's easier to see if there is overlap between metrics, etc.
To be clear, I'm 100% ok leaving this as-is, mostly wanting to flesh out the motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding those Connected*
metrics my intentions were as follows:
ConnectedPeersCount
: I realized we often check this parameter in our logs treating it as a general clue about the network health and the magnitude of the possible load. Apart from that, if something goes wrong with the node, this parameter often falls suddenly. Hence, I think this is a good candidate for monitoring.ConnectedBootstrapPercentage
: This one is here to give a quick response to the question: "what's the condition of my configured network entry point?" and raise an alert if the condition is rather bad. If someone configures only two bootstraps, this is their own risk. This metric just shows the situation relative to theLibP2P.Peers
property. I think we must be relative here because we can't easily say how much bootstraps on the list are enough as it depends on the network size. Second thing, we usepercentage
instead ofcount
because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrapcount
is less than5
, it may be a true alert for a node with100
configured peers but not quite for a node with10
configured peers. Defining an alert usingpercentage
is easier and means the same for every node.
Nevertheless, as I mentioned above, this PR is just a starting point and its main scope is to provide metrics "framework" and some experimental metrics to give a good development base towards a full-fledged metrics system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"what's the condition of my configured network entry point?"
I like this - and did not position it that way when working through the code initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second thing, we use percentage instead of count because I think we can have some troubles when defining alerts if we use the latter. For example, if we decide to alert when the bootstrap count is less than 5, it may be a true alert for a node with100 configured peers but not quite for a node with 10 configured peers. Defining an alert using percentage is easier and means the same for every node.
I do agree with everything else but I don't think I agree with this single statement. 😆
If we want to answer the question of "what's the condition of my configured entry point", I think we should be very explicit and expect the operator to define their comfort level themselves. Just like @sthompson22's noted, 50% of 4 is something else than 50% of 200.
I like we have a separate metric for all connected peers and for bootstrap peers but I do think bootstrap peers metric should be a number, not a percentage and the operator should define the alert knowing their configuration and number of bootstrap peers defined there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can change connected_bootstrap_percentage
to connected_bootstrap_count
. Having it as percents seemed handier at the beginning but after summarizing all points, making it as explicit count appears as a better option now.
cmd/start.go
Outdated
if tick := config.Metrics.Tick; tick != 0 { | ||
observationTick = time.Duration(tick) * time.Second | ||
} else { | ||
observationTick = 60 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like 60sec tick for network metrics but I don't like it for ETH connectivity. I think we should have a more conservative tick for this one to do not affect available requests counts in case someone uses 3rd party provider for Ethereum. Maybe every 10 minutes is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd put those defaults as public constants in metrics
package and use them from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change it this way but 10 minutes means also a huge delay regarding alerts. Isn't eth node connectivity problems something we want to know quickly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really depends on the setup. The majority of stakers will use a 3rd party provider for Ethereum. I expect they will be comfortable with 10 or even 30 minutes ticks. The ones running their own Ethereum clients will probably want to have 1 minute ticks. As long as those values are configurable in toml we are fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not looked, but if this allows alerting by predefined tick OR first failure, it seems like that'd cover most cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we just expose metrics, alerts are an external process. A typical situation regarding this metric will be as follow:
eth_connectivity
metric is checked every 10 minutes internally by the client- An external monitoring tool (e.g. Prometheus) will call the metrics endpoint, for example, every minute but it will record the same
eth_connectivity
value most of the time. - An alert (e.g. defined in Prometheus) will be raised if the metric value drops below an arbitrary threshold for a defined amount of time, for example, 1 minute.
So, because of the above, we may have a situation when we detect connectivity problems even 10 minutes after they actually occurred, and then we receive an alert after the time mentioned in point 3.
But, as Piotr said, it is configurable so we can adjust those values according to our needs.
Works as advertised.
|
Depends on keep-network/keep-common#40Summary
Here we introduce some basic metrics gathering and exposing them through an HTTP endpoint.
Metrics table
Example response from the
/metrics
endpointPresented output follows the Prometheus text-based exposition format