Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics framework #926

Merged
merged 20 commits into from
Nov 26, 2024
Merged

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

This is a small wrapper around promethius that will significantly reduce boilerplate code when dealing with metrics. As an added perk, it is also capable of automatically generating a markdown containing documentation on all metrics the system is using.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
@cody-littley cody-littley self-assigned this Nov 21, 2024
Copy link
Contributor

@dmanc dmanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

// for this list, use the format "metricName:metricLabel" if the metric has a label, or just "metricLabel"
// if the metric does not have a label. Any fully qualified metric name that matches exactly with an entry
// in this list will be blacklisted (i.e. it will not be reported).
MetricsBlacklist []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should metrics filtering be something that is taken care of by the application or the component that is collecting the metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we were preparing for the (since abandoned) traffic generator, this seemed like a useful concept. I had added a large number of metrics, but there was concern that some of them were low bang for the buck (since it costs us $$$ to store them). We were planning on disabling a bunch of metrics with configuration changes, and turning them on in the future if we ever had an issue where they would be useful for debugging.

That being said, if people don't think this is a useful feature, it would be fairly straight forward to remove. What do others think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah all I'm saying is that it's probably possible to filter the metrics, logs in a similar manner with the grafana agent.

This is pretty useful now though because we haven't figure out how to do that in the grafana agent. Regardless, if we want to save more money we would need to learn how to do it in the grafana agent because we're not always running applications that allow metrics filtering in this manner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I have removed the metrics blacklisting feature from this framework.

@cody-littley
Copy link
Contributor Author

@dmanc

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

My plan was to leave existing non-v2 code alone. This doesn't really change the core framework and backend, it's mostly just a convenience wrapper. Are there any existing things that use metrics that will persist after v2 is enabled and v1 is deprecated?

Signed-off-by: Cody Littley <cody@eigenlabs.org>
@dmanc
Copy link
Contributor

dmanc commented Nov 22, 2024

@dmanc

Would be good to validate if we can migrate an existing component to use the metrics framework. Or is the plan for only the newer V2 components to utilize it?

My plan was to leave existing non-v2 code alone. This doesn't really change the core framework and backend, it's mostly just a convenience wrapper. Are there any existing things that use metrics that will persist after v2 is enabled and v1 is deprecated?

The churner would be one example

Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
@cody-littley
Copy link
Contributor Author

cody-littley commented Nov 25, 2024

@dmanc I don't think modifying the churner belongs as a part of this PR. Here's a draft PR that contains the changes to the churner. If we merge this PR, I will open another with just the changes to the churner's metrics: #934

The only difference between the existing churner metrics and the new churner metrics, after this change, will be that the metric eigenda_churner_requests will have a name change to eigenda_churner_request_count. This is because the new metric framework enforces a unit on each metric, and that unit gets appended to the name of the metric in Prometheus.

Note that there may be some deviations in the metrics framework code between this PR and the churner PR I link to above (since these branches are being worked in parallel). Please treat this PR as the source of truth for the metrics framework code.

Signed-off-by: Cody Littley <cody@eigenlabs.org>
v := reflect.ValueOf(labelTemplate)
t := v.Type()
labeler.templateType = t
for i := 0; i < t.NumField(); i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panics if labelTemplate is not a struct
We should check if v.Kind() == reflect.Struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

if v.Kind() != reflect.Struct {
    return nil, fmt.Errorf("label template must be a struct")
}

As an aside, add the reflection library to the list of things that make me annoyed at the people who designed golang. One of the core principals is that things should return errors, not panic. This is exactly the sort of situation where returning errors would be way better than panicking.

}

if !m.isAlive.Load() {
return errors.New("metrics server already stopped")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it still call m.server.Close()?

Copy link
Contributor Author

@cody-littley cody-littley Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, if we trigger this statement the server will have already been closed.

// labelMaker encapsulates logic for creating labels for metrics.
type labelMaker struct {
keys []string
emptyValues []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is emptyValues used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a label is set up with a non-null template, but no labels are provided at runtime, then this emptyValues list is passed go prometheus. Prometheus returns an error if you don't pass in the expected number of flags.

func (l *labelMaker) extractValues(label any) ([]string, error) {
	// ...

	if label == nil {
		return l.emptyValues, nil
	}

We could create a new empty list each time, but I thought it would be more resource efficient to just reuse the same empty list over and over.

description string,
labelTemplate any) (CountMetric, error) {

labeler, err := newLabelMaker(labelTemplate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we only create labeler when labelTemplate is not nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labeler becomes a no-op when the label template is nil. The purpose of using this pattern was to simplify the business logic a little. Instead of wrapping each use of the labeler in an if statement depending on whether the labeler is enabled or not, we can instead use the labeler in the same way regardless of whether or not we have a non-nil template.

This being said, if you don't like this pattern, let me know and I'll make the suggested change.

l = label[0]
}

values, err := m.labeler.extractValues(l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if metric has no labels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The labeler handles this edge case.

  • When l is nil, m.labeler.extractValues(l) returns a list of empty strings with length equal to the number of flags in the template.
  • If the template is nil, m.labeler.extractValues(l) returns an empty list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Does m.vec.WithLabelValues(values...) handle empty values gracefully?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m.vec.WithLabelValues() requires that the number of provided values be exactly equal to the number of registered keys. It's ok if a value is an empty string, but the number of strings must match.

Comment on lines 97 to 98
nanoseconds := float64(latency.Nanoseconds())
milliseconds := nanoseconds / 1e6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we use Milliseconds()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted to float64(time.Millisecond)

//
// The label parameter accepts zero or one label. If the label type does not match the template label type provided
// when creating the metric, an error will be returned.
Increment(label ...any) error
Copy link
Contributor

@ian-shim ian-shim Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, metrics methods don't return any errors. Applications shouldn't need to consider errors from methods like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the error, it will now log when it encounters a problem. Is that ok? Or should we instead panic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably sufficient. We shouldn't panic over logging issues

Signed-off-by: Cody Littley <cody@eigenlabs.org>
Signed-off-by: Cody Littley <cody@eigenlabs.org>
l = label[0]
}

values, err := m.labeler.extractValues(l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Does m.vec.WithLabelValues(values...) handle empty values gracefully?

//
// The label parameter accepts zero or one label. If the label type does not match the template label type provided
// when creating the metric, an error will be returned.
Increment(label ...any) error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably sufficient. We shouldn't panic over logging issues

@cody-littley cody-littley merged commit c28f42f into Layr-Labs:master Nov 26, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants