-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics SDK work-in-progress #172
Conversation
If grouping is done in exporter then all exporters would have to implement that. Alternatively it could be done by an optional metric processor that can be shared by exporters. |
@rghetia Grouping requires configuration, it's effectively a "views" API to configure which groupings you want to compute. It's good to call this a Metrics processor too, but it's sort of a Metircs Views processor. It should export grouped aggregates and be an optional part of an export pipeline. To export full cardinality, it'll go SDK->Exporter. To export group-by w/ custom dimensions, it'll go SDK->Metrics View processor->Exporter. |
81f9551
to
3516ebc
Compare
I am working on restoring this, sorry for the confusing close. |
This SDK is not a complete metrics exporter, but it implements counters and gauges and a "MaxSumCount" aggregation which I intend to demonstrate for measure metrics. I'm not sure the This first point deserves attention. There are no locks here, which means race conditions can cause the system to create multiple records for the same instrument and label set. This is harmless, it just means the exporter has to be aware that the inputs are not de-duplicated. @rghetia @bogdandrutu please take a look. |
The new test is too large for CI. I'll adjust next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first review round.
for { | ||
gd := (*gaugeData)(atomic.LoadPointer(&g.live)) | ||
|
||
if descriptor.ValueKind() == api.Int64ValueKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you add the exported NewMeasurementValueFromRaw
(it currently is unexported as newFromRaw
), (or just used MeasurementValue
inside gaugeData
) then you could avoid the duplication with:
for {
gd := (*gaugeData)(atomic.LoadPointer(&g.live))
if gd != nil {
old := api.NewMeasurementValueFromRaw(gd.value)
if old.RawCompare(value.AsRaw(), descriptor.ValueKind()) > 0 {
// TODO: warn
}
ngd.value = value.AsRaw()
}
if atomic.CompareAndSwapPointer(&g.live, unsafe.Pointer(gd), unsafe.Pointer(ngd)) {
return
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion.
|
||
atomic.AddUint64(&c.live.count, 1) | ||
|
||
if descriptor.ValueKind() == api.Int64ValueKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
atomic.AddUint64(&c.live.count, 1)
if descriptor.ValueKind() == api.Int64ValueKind {
internal.NewAtomicInt64(&c.live.sum).Add(value.AsInt64())
} else {
internal.NewAtomicFloat64(&c.live.sum).Add(value.AsFloat64())
}
for {
var current api.MeasurementValue
if descriptor.ValueKind() == api.Int64ValueKind {
c := internal.NewAtomicInt64(&c.live.max).Load()
current = api.NewInt64MeasurementValue(c)
} else {
c := internal.NewAtomicFloat64(&c.live.max).Load()
current = api.NewInt64MeasurementValue(c)
}
if value.RawCompare(current.AsRaw(), descriptor.ValueKind()) <= 0 {
break
}
if atomic.CompareAndSwapUint64(&c.live.max, current.AsRaw(), value.AsRaw()) {
break
}
}
But in general I have an impression that we should either merge metric.MeasurementValue
and the atomic helpers into something like core.Value
or just bite the bullet and have only the float64 values. In the first case the code could look like:
// count, max and sum in c.live are of the hypothetical core.Value type
c.live.count.AddAtomic(core.Int64ValueKind, 1)
c.live.sum.AddAtomic(descriptor.ValueKind(), value)
for {
currentRaw := c.live.max.LoadAtomic()
if value.RawCompare(currentRaw, descriptor.ValueKind()) <= 0 {
break
}
// or if c.live.max.CompareAndSwap(currentRaw, update.AsRaw())
if atomic.CompareAndSwapUint64(c.live.max.AsRawPtr(), currentRaw, update.AsRaw()) {
break
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of merging the atomic helpers into MeasurementValue.
sdk/metrics/sdk.go
Outdated
} | ||
sorted = sorted[0:oi] | ||
|
||
// Serialize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this step could be forwarded to the exporter, so it can prepare the labels in its own way? For example, prometheus exporter would likely do some validation/transformations of the labels and its values, so it can be quickly passed to CounterVec/GaugeVec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was definitely thinking that the exporter would want to configure this. I'll mention it as a TODO. Again, it only works when there is one exporter.
That said, I'm not sure we should use the Prometheus client libraries directly to implement a Prometheus exporter, it might be not very efficient. I've begun talking with an engineer at chronosphere.io, who is interested in contributing a Prometheus exporter and has a lot of relevant experience. I will introduce this person in the Thursday SIG meeting.
sdk/metrics/sdk.go
Outdated
// DeleteHandle removes a reference on the underlying record. This | ||
// method checks for a race with `Collect()`, which may have (tried to) | ||
// reclaim an idle recorder, and restores the situation. | ||
func (m *SDK) DeleteHandle(r api.Handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the exporter should be let know about the handle being deleted, it might need to do some cleanups on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prometheus libraries support a Delete()
on their handles, but it's not meant to be used in this way. Prometheus essentially expects you never to remove handles from an exporter, so this is a grey area. I think exporters should take care of this problem on their own. A push exporter probably won't have to worry about it.
|
||
// MetricBatcher is responsible for deciding which kind of aggregation | ||
// to use and gathering results during the SDK Collect(). | ||
type MetricBatcher interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we may have two modes of collecting the metrics - the push model and the pull model. I'm trying to imagine how to support both.
So the push model I suppose that the application can set up some goroutine that will call SDK.Collect()
in some intervals. That would result in metrics being pushed in those intervals somewhere.
In the pull model, I imagine that the collection should be controlled by the exporter, because the exporter knows when the request for the metrics came. So I'm not sure if this needs to be reflected in the API (like in the MetricBatcher interface) or can be done in an exporter-specific way like:
exporter := somepkg.NewPushExporter(someConfig)
sdk := metricsdk.New(exporter)
exporter.OnRequest(func() {
sdk.Collect(ctx)
})
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as there is only one exporter, it should be up to the exporter to configure collection. If push, the exporter can use a timer to establish a period. If pull, the incoming request will dictate the period.
I don't think it is practically very important to support multiple simultaneous metrics exporters, and I admit this leaves a bit of a hole since surely someone will invent a reason.
I'm comfortable in leaving a bit of ambiguity here, as these questions are mostly driven by existing statsd and prometheus systems, and figuring out how often to call Collect()
is relatively easy compared with the other open questions.
The bigger question is how we will configure the metrics to export: which dimensions to use, which format to expose (e.g., histogram vs summary). There is a desire to move towards the OpenMetrics format for pushing data, and I want to push for a protocol for requesting a current push configuration. Ideally, we would implement exporters in the OpenTelemetry collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as there is only one exporter, it should be up to the exporter to configure collection. If push, the exporter can use a timer to establish a period. If pull, the incoming request will dictate the period.
I think I saw a comment somewhere that the application should call the SDK.Collect()
function. This will need changing then.
I don't think it is practically very important to support multiple simultaneous metrics exporters, and I admit this leaves a bit of a hole since surely someone will invent a reason.
That arguably could be supported by writing an exporter that accepts some "subexporters". Something like (strawman follows):
pushExporters := []super.PushExporter{someSuperPushExporterInterfaceImplementation()}
pullExporters := []super.PullExporter{someSuperPullExporterInterfaceImplementation()}
ex := super.NewExporter(pushExporters, pullExporters)
sdk := metric.New(ex)
That would maybe allow an optimization of doing the metric accounting work once for all the push exporters, but every pull exporter would need to do the accounting on their own.
The bigger question is how we will configure the metrics to export: which dimensions to use, which format to expose (e.g., histogram vs summary). There is a desire to move towards the OpenMetrics format for pushing data, and I want to push for a protocol for requesting a current push configuration. Ideally, we would implement exporters in the OpenTelemetry collector.
That feels like something that an exporter should worry about, not API or SDK, right? Or are you pushing for some single unified SDK code for dealing with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would maybe allow an optimization of doing the metric accounting work once for all the push exporters, but every pull exporter would need to do the accounting on their own.
A separate accounting is only required if an exporter requires delta (between two pull). If accounting is all cumulative then single accounting works for both push and pull.
If the accounting is common then the configuration also has to be common and not associated with the exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we will end up with a common configuration, after consulting several people on the topic. I don't really think people want to run multiple metrics exporters, they would prefer to export to the otel-collector and let it export to multiple destinations.
I think we should come up with a configuration protocol, but it would be specifically used for the client-to-OTel-collector.
From the prometheus point of view (after writing an initial prometheus exporter) - SDK already does some work that the prometheus client library does too. I suppose that the prometheus exporter would not use the client library to avoid the duplication of work. Instead the exporter could try to write a reply to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm responding w/o pushing any code changes at this time. I completely agree with your remarks about MeasurementValue. The remaining open questions are going to take more work to resolve.
|
||
// MetricBatcher is responsible for deciding which kind of aggregation | ||
// to use and gathering results during the SDK Collect(). | ||
type MetricBatcher interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as there is only one exporter, it should be up to the exporter to configure collection. If push, the exporter can use a timer to establish a period. If pull, the incoming request will dictate the period.
I don't think it is practically very important to support multiple simultaneous metrics exporters, and I admit this leaves a bit of a hole since surely someone will invent a reason.
I'm comfortable in leaving a bit of ambiguity here, as these questions are mostly driven by existing statsd and prometheus systems, and figuring out how often to call Collect()
is relatively easy compared with the other open questions.
The bigger question is how we will configure the metrics to export: which dimensions to use, which format to expose (e.g., histogram vs summary). There is a desire to move towards the OpenMetrics format for pushing data, and I want to push for a protocol for requesting a current push configuration. Ideally, we would implement exporters in the OpenTelemetry collector.
for { | ||
gd := (*gaugeData)(atomic.LoadPointer(&g.live)) | ||
|
||
if descriptor.ValueKind() == api.Int64ValueKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion.
|
||
atomic.AddUint64(&c.live.count, 1) | ||
|
||
if descriptor.ValueKind() == api.Int64ValueKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of merging the atomic helpers into MeasurementValue.
sdk/metrics/sdk.go
Outdated
} | ||
sorted = sorted[0:oi] | ||
|
||
// Serialize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was definitely thinking that the exporter would want to configure this. I'll mention it as a TODO. Again, it only works when there is one exporter.
That said, I'm not sure we should use the Prometheus client libraries directly to implement a Prometheus exporter, it might be not very efficient. I've begun talking with an engineer at chronosphere.io, who is interested in contributing a Prometheus exporter and has a lot of relevant experience. I will introduce this person in the Thursday SIG meeting.
sdk/metrics/sdk.go
Outdated
// DeleteHandle removes a reference on the underlying record. This | ||
// method checks for a race with `Collect()`, which may have (tried to) | ||
// reclaim an idle recorder, and restores the situation. | ||
func (m *SDK) DeleteHandle(r api.Handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prometheus libraries support a Delete()
on their handles, but it's not meant to be used in this way. Prometheus essentially expects you never to remove handles from an exporter, so this is a grey area. I think exporters should take care of this problem on their own. A push exporter probably won't have to worry about it.
I'm working on some benchmarks to establish the performance of this approach and compare it with other libraries. |
These are the current benchmark results:
@krnowak Note there are 3 allocs per GetHandle in the |
I think it's too expensive :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly small nits I can fix myself quickly. But something in aggregators stroke me as odd.
What kind of accounting the aggregators do?
Counters/MSC seem to accumulate between pulls/collection/exports and then resets the counters/sums/maxes to zero, which feels like delta accounting, right?
Gauge aggregator also resets the setting to zero on collect, which makes me think it is possible to have a monotonic gauge, set it 200, wait for collect (so it will be set to 0), and set it to 100, which will result in no warning from the aggregator despite the new value being lower than the old one.
It's likely some misunderstanding on my side of how metrics exporter should work and what are the expectations of the agents that consume the data from exporters.
Pushed two commits to fix the nits I've raised. |
@krnowak Your confusion about the aggregators is justified. My goal has been to allow the SDK to "forget" entries eventually, I consider this to be more important than enforcing the semantics of monotonic gauges. The gauge aggregator makes a best effort to implement the monotonic property, but at some level this is only a semantic declaration that the metrics consumer will use. I would add a comment saying that in order to get 100% enforcement of monotonicity, the caller must use a metric handle, which ensures the aggregator is pinned in memory and will correctly enforce monotonicity. |
I also have concern about reseting it to zero. If you have multiple exporters/collectors then this will not work. Granted that we don't support multiple exporter at this point so it is not issue. The other thing I tried was to write stdout exporter based off of this PR. I could not write a simple stdout exporter like we have one for trace. I think we need common aggregator that outputs aggregated metrics in some metric data model. ( I mentioned this in earlier comment but for some reason I can't find it). |
@rghetia My position is that the SDK should be allowed to be stateless. Exporters that want to retain memory and compute sums are able to do so. The SDK is entirely delta-oriented. Multiple exporters can be implemented by some kind of multi-plexing exporter. I'd like to call that out-of-scope. There are changes pending in the next PR, currently here: jmacd#4, which will make a stdout metrics exporter easy to implement. The aggregators need a When I think about a stdout exporter, I would be inclined to print one event per metric event. I don't think that's what you're asking for? It sounds like you'd like the stdout exporter to maintain state and print sums? That's do-able, but wasteful IMO. |
Even if you use the metric handle to pin the aggregator in memory, the value of the gauge is still being reset to zero on every collection, so it is still possible to break the monotonicity of the gauge, with something like: exporter := someExporter()
sdk := sdkmetric.New(exporter)
gauge := sdk.NewInt64Gauge("gauge", metrics.WithMonotonic(true))
labels := sdk.Labels()
handle := gauge.AcquireHandle(labels)
handle.Set(200)
sdk.Collect() // or call exporter.ForceCollection() or something
handle.Set(100) |
@krnowak Ah, right. I'll fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't understand everything but wanted to read this PR to learn about the metrics SDK. I've made one question and some nits.
Btw, this non-blocking algorithm seems really cool :)
"unsafe" | ||
) | ||
|
||
func (l *sortedLabels) Len() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are this methods not in the same file that its types are declared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like they were cluttering up the main logic. This file is full of trivial list manipulations. If you still think I should move these, happy to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. Changing these types to other files would be an option? It would be a little strange since these types are specifically used in the SDK but maybe better than a huge sdk
file.
I'm fine with you splitting these or keep this way.
@jmacd : The fix looks good. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other reviewers pointed out some nits in the docs. With them fixed, I think this PR is good to be merged.
Out-of-scope for this PR? If yes, I agree.
I'll take a look.
I was thinking from stateful SDK where aggregation is common for multiple exporter where one could replace with stdout exporter to quickly try it out. But if SDK is going to be stateless then I agree that stdout should simply print the events. |
This PR branches off of #100 and adds a metrics SDK.
It's missing a lot, but I thought it would be best to post this now, by way of assisting with #100 as well as the associated spec change
open-telemetry/opentelemetry-specification#250
and directly addresses part of spec issue
open-telemetry/opentelemetry-specification#259
which is that it's difficult to fully understand the API specification without seeing a real SDK.
This is very incomplete, for example it needs:
More changes will be needed to accommodate measures. Before the work in
open-telemetry/opentelemetry-specification#259
can be finished, I'd like to complete this prototype with:
There is one issue that is worth considering (@bogdandrutu). This SDK aggregates metrics by (DescriptorID, LabelSet) in the SDK itself, but does not "fold" for the group-by operation. I.e., it does not apply the group-by or support selecting a subset of aggregation keys directly. My idea is to implement group-by in the exporter. It's simpler this way, since it's not on the critical path for callers. This design could be extended with further complexity to address the group-by up-front, but I would like to be convinced it's worthwhile.