-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics export pipeline + metrics stdout exporter #341
Conversation
Codecov Report
@@ Coverage Diff @@
## master #341 +/- ##
=======================================
Coverage 85.85% 85.85%
=======================================
Files 41 41
Lines 2078 2078
Branches 242 242
=======================================
Hits 1784 1784
Misses 223 223
Partials 71 71
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is built on a WIP spec I don't see much of a reason to not approve it.
A couple thoughts:
- it would be really nice if there was some clear, concise examples. The unit tests seem to cover different pieces without giving me a full example on how I configure a meter in an app to actually record and export a value to console. That would be very helpful if the goal is to evaluate the API
- to that end, some sort of document that helps outline how to work on more and more complex cases would be helpful, even before implementing any more code. For example:
a. start with configuring a single, no dimensionality metric and exporting to console
b. single metric with dimensions (introduce label sets)
c. multiple metrics that need to be aggregated (aggregator).
I've left some thoughts, but it's very hard to evaluate the interface and usage without a common use case. Thoughts on making the next PR a full example of collecting application metrics (CPU / memory, thread count) so we can see how it'll actually work in practice?
@@ -220,7 +220,7 @@ def create_metric( | |||
metric_type: Type[MetricT], | |||
label_keys: Sequence[str] = (), | |||
enabled: bool = True, | |||
monotonic: bool = False, | |||
alternate: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO "alternate" doesn't really correspond to the behavior. is this still in flux? what's the best way to give feedback if this is a spec-level decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct. There are new terms for expressing default behaviour here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -298,9 +373,6 @@ def get_label_set(self, labels: Dict[str, str]): | |||
# Use simple encoding for now until encoding API is implemented | |||
encoded = tuple(sorted(labels.items())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose behind the label encodings? is this similar to the previous PR?
There isn't much of a performance improvement here, since the creation of the encoded cache key to even check against is quite expensive (iterate dict -> sort -> tuple).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking through it, there's some level of merit in not recreating the same label set over and over again, but constructing the labels dictionary is of itself quite expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I believe the optimization is so that the same label set does not need to be calculated over and over again. Currently, the encoding used is a default placeholder encoding type. There is talks about implementing a configurable encoder in the SDK, so user's can configure the type of encoding appropriate to the exporter they want to send metrics to. There is no specs for it yet but the Go implementation has created one. I think once we have this, vendors can pass in their own encoders with however they want to encode, so the iterate dict -> sort -> tuple algorithm does not always apply.
def __init__(self): | ||
self.current = None | ||
self.check_point = None | ||
self._lock = threading.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is a lock the right thing to put in the base class? I understand most implementations will want that, but it's not particularly related to the interface, and we may be needlessly locking if we have an async aggregator or use atomic integers and doubles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if an async aggregator makes sense, as we would want to export metric values at a certain point in time. With that being said, all implementations would need a way to handle concurrent updates, so I would think putting the lock in the base class makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, I think there's a good reason for keeping implementation out of ABCs even though it's technically allowed. We addressed (but didn't resolve) this in #311 (comment).
In this particular case I still don't think it makes sense to instantiate the lock here. This (half-abstract) implementation doesn't use it, and it's not part of the class' API.
self.current += value | ||
|
||
def checkpoint(self): | ||
# TODO: Implement lock-free algorithm for concurrency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to completely contradict myself earlier: is it worth implementing a lock-free algorithm here? checkpoint shouldn't be called particularly often, and lock-free algorithms are easy to get wrong and can often lead to worse performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no specs for for how to implement checkpoint
to be concurrent. There was talks that using a lock would be too slow, but I agree that lock-free algorithms are prone to error and might not guarantee performance. I think I might change this to a question, rather than a TODO.
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
self.current = 0 | ||
|
||
def merge(self, other): | ||
self.check_point += other.check_point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should current be updated as well? seems like this addition would be cleared immediately by the next call to checkpoint, which will match what exists in current.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merge
is usually called right after checkpoint()
in the process
step in a Batcher
. This handles the race condition where an update to the metric happens at the same time as process
. merge
updates the checkpoint
value to prepare for exporting, while the current is left untouched (but changed by the update seperately to be exported later on).
# Collect all of the meter's metrics to be exported | ||
self.meter.collect() | ||
# Export the given metrics in the batcher | ||
self.exporter.export(self.meter.batcher.check_point_set()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a design concern with calling batcher methods directly for things like setting checkpoints and declaring the collection as finished, but calling the more general meter on collect?
I guess to be clearer: are there operations that the general meter should perform, on top of just the batcher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The architecture is designed so that the meter
handles the "metric related events" such as construction, updates and label sets, while the batcher
is responsible for collection and aggregation. Yes, the meter could technically perform all the instructions handled by the batcher, but this separation of responsibilities by the above seems like a good logical pattern. What design concerns are you thinking of specifically?
aggregator.update(1.0) | ||
label_set = {} | ||
batch_map = {} | ||
batch_map[(metric, "")] = (aggregator, label_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this block sets a very specific expectation around the format of the batch_map. In general, I feel like this is a bit error-prone as assigning the value requires knowing the right pair of tuples to set as the key and the value.
Is there some refactoring that can be done to not require direct manipulation of the batch_map, providing helper methods instead? that way, there's clear documentation on the structure of the key-value pairs, or hopefully one doesn't have to set the key-value pairs at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think adding a helper method to assign values to the batch map would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: There seems to be some problems with circular imports (since init.py depends on the batcher.py and batcher.py would have to depend on init.py for typing). Sphinx also complains too since the forward refs aren't able to be found. Any ideas on how to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do you see the circular imports? In general this can be solved by factoring out the circular component in it's own module, and having the circular importees import that instead.
@toumorokoshi |
I've added some more examples under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a partial review, I haven't finished but I don't want to risk losing my comments.
So far I am concerned about two points:
- I don't see usage of locks / atomic types to handle concurrency.
- Handles are not being released.
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/batcher.py
Outdated
Show resolved
Hide resolved
opentelemetry-sdk/src/opentelemetry/sdk/metrics/export/aggregate.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some few nits, nothing to worry about.
I think it is a good base to move ahead. It would be nice if we could have a list of tasks to be done after this is merged, like, implementing other aggregators, etc.
self.tick() | ||
|
||
def shutdown(self): | ||
self.finished.set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the controller perform the last tick()
before shutting down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. There isn't clearly defined behaviour yet for flushing metrics on application close. I think this is a discussion for the spec.
examples/opentelemetry-example-app/src/opentelemetry_example_app/metrics_example.py
Show resolved
Hide resolved
examples/opentelemetry-example-app/src/opentelemetry_example_app/metrics_example.py
Outdated
Show resolved
Hide resolved
@mauriciovasquezbernal Thanks so much for reviewing! Can you approve if everything else looks okay? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last point, if tick()
is not added in #341 (comment), then a sleep has to be added to the record example, otherwise it'll not print anything, besides that LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some drive-by comments on units in the examples.
I opened lzchen#7 to fix the default meter type. A better long term fix might be to pass both the abstract and defaut types to the loader. |
Make default meter a DefaultMeter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more blocking comments from me, I'll follow up with a comment about my lingering design concerns.
Initial implementation of the end-to-end metrics pipeline.
* feat(jaeger-exporter): adds flushing on an interval closes open-telemetry/opentelemetry-js#340 * respond to comments * yarn fix
This completes an end-to-end metrics export pipeline.
Uses alot of elements from the go implementation as well as the WIP spec.
The new files that are added relate to the export pipeline:
batcher.py
,aggregate.py
andcontroller.py
, which contain the implementations forBatcher
,Aggregator
andController
classes (see this diagram for an overview of how the pieces work together).Currently, only the
UngroupedBatcher
andCounterAggregator
are implemented. This also means that there are noAggregationSelector
s since aggregations forMeasure
have not been implemented. aPushController
is also included, which is worker thread that continuously initiates collection of metrics and calls the exporter. metrics_example.py` in the examples folder shows how to use these components.Some SDK changes:
Metric
s have a reference now to theMeter
that created them. This is used for constructing an aggregator for the specific metric type when creating a handle.LabelSet
has a field for its encoded value