Metrics framework #926

cody-littley · 2024-11-21T22:46:17Z

Why are these changes needed?

This is a small wrapper around promethius that will significantly reduce boilerplate code when dealing with metrics. As an added perk, it is also capable of automatically generating a markdown containing documentation on all metrics the system is using.

Checks

I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
I've checked the new test coverage and the coverage percentage didn't drop.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

Signed-off-by: Cody Littley <cody@eigenlabs.org>

dmanc

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

dmanc · 2024-11-22T03:33:27Z

common/metrics/config.go

+	// for this list, use the format "metricName:metricLabel" if the metric has a label, or just "metricLabel"
+	// if the metric does not have a label. Any fully qualified metric name that matches exactly with an entry
+	// in this list will be blacklisted (i.e. it will not be reported).
+	MetricsBlacklist []string


Should metrics filtering be something that is taken care of by the application or the component that is collecting the metrics?

When we were preparing for the (since abandoned) traffic generator, this seemed like a useful concept. I had added a large number of metrics, but there was concern that some of them were low bang for the buck (since it costs us $$$ to store them). We were planning on disabling a bunch of metrics with configuration changes, and turning them on in the future if we ever had an issue where they would be useful for debugging.

That being said, if people don't think this is a useful feature, it would be fairly straight forward to remove. What do others think?

Yeah all I'm saying is that it's probably possible to filter the metrics, logs in a similar manner with the grafana agent.

This is pretty useful now though because we haven't figure out how to do that in the grafana agent. Regardless, if we want to save more money we would need to learn how to do it in the grafana agent because we're not always running applications that allow metrics filtering in this manner.

Makes sense. I have removed the metrics blacklisting feature from this framework.

cody-littley · 2024-11-22T14:07:37Z

@dmanc

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

My plan was to leave existing non-v2 code alone. This doesn't really change the core framework and backend, it's mostly just a convenience wrapper. Are there any existing things that use metrics that will persist after v2 is enabled and v1 is deprecated?

Signed-off-by: Cody Littley <cody@eigenlabs.org>

dmanc · 2024-11-22T20:42:13Z

@dmanc

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

My plan was to leave existing non-v2 code alone. This doesn't really change the core framework and backend, it's mostly just a convenience wrapper. Are there any existing things that use metrics that will persist after v2 is enabled and v1 is deprecated?

The churner would be one example

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley · 2024-11-25T19:39:38Z

@dmanc I don't think modifying the churner belongs as a part of this PR. Here's a draft PR that contains the changes to the churner. If we merge this PR, I will open another with just the changes to the churner's metrics: #934

The only difference between the existing churner metrics and the new churner metrics, after this change, will be that the metric eigenda_churner_requests will have a name change to eigenda_churner_request_count. This is because the new metric framework enforces a unit on each metric, and that unit gets appended to the name of the metric in Prometheus.

Note that there may be some deviations in the metrics framework code between this PR and the churner PR I link to above (since these branches are being worked in parallel). Please treat this PR as the source of truth for the metrics framework code.

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim · 2024-11-26T06:05:45Z

common/metrics/label_maker.go

+	v := reflect.ValueOf(labelTemplate)
+	t := v.Type()
+	labeler.templateType = t
+	for i := 0; i < t.NumField(); i++ {


This panics if labelTemplate is not a struct
We should check if v.Kind() == reflect.Struct

Good idea, done.

if v.Kind() != reflect.Struct { return nil, fmt.Errorf("label template must be a struct") }

As an aside, add the reflection library to the list of things that make me annoyed at the people who designed golang. One of the core principals is that things should return errors, not panic. This is exactly the sort of situation where returning errors would be way better than panicking.

ian-shim · 2024-11-26T06:24:30Z

common/metrics/metrics_server.go

+	}
+
+	if !m.isAlive.Load() {
+		return errors.New("metrics server already stopped")


should it still call m.server.Close()?

Unless I'm missing something, if we trigger this statement the server will have already been closed.

ian-shim · 2024-11-26T06:29:05Z

common/metrics/label_maker.go

+// labelMaker encapsulates logic for creating labels for metrics.
+type labelMaker struct {
+	keys         []string
+	emptyValues  []string


how is emptyValues used?

If a label is set up with a non-null template, but no labels are provided at runtime, then this emptyValues list is passed go prometheus. Prometheus returns an error if you don't pass in the expected number of flags.

func (l *labelMaker) extractValues(label any) ([]string, error) { // ... if label == nil { return l.emptyValues, nil }

We could create a new empty list each time, but I thought it would be more resource efficient to just reuse the same empty list over and over.

ian-shim · 2024-11-26T06:37:27Z

common/metrics/count_metric.go

+	description string,
+	labelTemplate any) (CountMetric, error) {
+
+	labeler, err := newLabelMaker(labelTemplate)


should we only create labeler when labelTemplate is not nil?

The labeler becomes a no-op when the label template is nil. The purpose of using this pattern was to simplify the business logic a little. Instead of wrapping each use of the labeler in an if statement depending on whether the labeler is enabled or not, we can instead use the labeler in the same way regardless of whether or not we have a non-nil template.

This being said, if you don't like this pattern, let me know and I'll make the suggested change.

ian-shim · 2024-11-26T06:38:09Z

common/metrics/count_metric.go

+		l = label[0]
+	}
+
+	values, err := m.labeler.extractValues(l)


what if metric has no labels?

The labeler handles this edge case.

When l is nil, m.labeler.extractValues(l) returns a list of empty strings with length equal to the number of flags in the template.

If the template is nil, m.labeler.extractValues(l) returns an empty list.

I see. Does m.vec.WithLabelValues(values...) handle empty values gracefully?

m.vec.WithLabelValues() requires that the number of provided values be exactly equal to the number of registered keys. It's ok if a value is an empty string, but the number of strings must match.

ian-shim · 2024-11-26T06:41:58Z

common/metrics/latency_metric.go

+	nanoseconds := float64(latency.Nanoseconds())
+	milliseconds := nanoseconds / 1e6


why don't we use Milliseconds()?

converted to float64(time.Millisecond)

ian-shim · 2024-11-26T06:45:12Z

common/metrics/metrics.go

+	//
+	// The label parameter accepts zero or one label. If the label type does not match the template label type provided
+	// when creating the metric, an error will be returned.
+	Increment(label ...any) error


Ideally, metrics methods don't return any errors. Applications shouldn't need to consider errors from methods like this

I've removed the error, it will now log when it encounters a problem. Is that ok? Or should we instead panic?

That's probably sufficient. We shouldn't panic over logging issues

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim · 2024-11-26T20:09:07Z

common/metrics/count_metric.go

+		l = label[0]
+	}
+
+	values, err := m.labeler.extractValues(l)


I see. Does m.vec.WithLabelValues(values...) handle empty values gracefully?

ian-shim · 2024-11-26T20:13:06Z

common/metrics/metrics.go

+	//
+	// The label parameter accepts zero or one label. If the label type does not match the template label type provided
+	// when creating the metric, an error will be returned.
+	Increment(label ...any) error


That's probably sufficient. We shouldn't panic over logging issues

cody-littley added 8 commits November 21, 2024 14:03

Created new metrics framework in common.

07f0cee

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Got latency metrics working.

e444849

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Added counter.

538470c

Signed-off-by: Cody Littley <cody@eigenlabs.org>

All metric types working.

1083c94

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Added auto-gauge.

31c4a43

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Add mock metrics.

7d826a8

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Auto-generate metrics docs.

49b11e1

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Pass unit as part of promethious metadata.

2d6c116

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley requested review from jianoaix and ian-shim November 21, 2024 22:46

cody-littley self-assigned this Nov 21, 2024

dmanc reviewed Nov 22, 2024

View reviewed changes

Use ticker instead of sleeping.

fbc7bc1

Signed-off-by: Cody Littley <cody@eigenlabs.org>

cody-littley added 8 commits November 25, 2024 09:52

Made suggested changes.

02e710d

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Improve documentation.

0217da7

Signed-off-by: Cody Littley <cody@eigenlabs.org>

lint

b0fd072

Signed-off-by: Cody Littley <cody@eigenlabs.org>

incremental progress

72834e3

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Incremental progress.

8f28ec0

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Add labels to counts

bcdd296

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Finish new label system.

1a36247

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Improve documentation for labels.

b28e075

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Cleanup.

f9df03e

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim reviewed Nov 26, 2024

View reviewed changes

cody-littley added 2 commits November 26, 2024 08:44

Made suggested changes.

04df0d0

Signed-off-by: Cody Littley <cody@eigenlabs.org>

Made suggested changes.

afe3120

Signed-off-by: Cody Littley <cody@eigenlabs.org>

ian-shim approved these changes Nov 26, 2024

View reviewed changes

cody-littley merged commit c28f42f into Layr-Labs:master Nov 26, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics framework #926

Metrics framework #926

cody-littley commented Nov 21, 2024

dmanc left a comment

dmanc Nov 22, 2024

cody-littley Nov 22, 2024

dmanc Nov 22, 2024

cody-littley Nov 25, 2024

cody-littley commented Nov 22, 2024

dmanc commented Nov 22, 2024

cody-littley commented Nov 25, 2024 •

edited

Loading

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024 •

edited

Loading

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024 •

edited

Loading

cody-littley Nov 26, 2024

ian-shim Nov 26, 2024

ian-shim Nov 26, 2024

ian-shim Nov 26, 2024

		nanoseconds := float64(latency.Nanoseconds())
		milliseconds := nanoseconds / 1e6

Metrics framework #926

Metrics framework #926

Conversation

cody-littley commented Nov 21, 2024

Why are these changes needed?

Checks

dmanc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley commented Nov 22, 2024

dmanc commented Nov 22, 2024

cody-littley commented Nov 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ian-shim Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley commented Nov 25, 2024 •

edited

Loading

cody-littley Nov 26, 2024 •

edited

Loading

ian-shim Nov 26, 2024 •

edited

Loading