Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various FluxMetric suggested improvements #1242

Closed
smaldini opened this issue Jun 15, 2018 · 4 comments
Closed

Various FluxMetric suggested improvements #1242

smaldini opened this issue Jun 15, 2018 · 4 comments
Assignees
Labels
type/enhancement A general enhancement
Milestone

Comments

@smaldini
Copy link
Contributor

smaldini commented Jun 15, 2018

I had a chance to stress the metrics story in more length and I would like to propose a few improvements:

  • Produce latency buckets, maybe with distribution support in MeterFilter, might need a reactor scoped metricsRegistry to add into the global registry. This should be available as an argument like spring-boot in order to macro monitor latency distribution for a sequence (Mono/Flux) or for onNext (Flux). It looks like that in boot (note the "le" tag):
    http_client_requests_seconds_bucket{clientName="localhost",method="GET",status="IO_ERROR",uri="/details/{ticker}",le="12.884901886",} 254.0

  • I think request metrics are not correct as we use totals (requestCounter). It should be computed per Subscriber then added to the distribution, e.g. keeping track of the interval between 2 requests with the requested volume as a tag so metrics query can compute asked data/second.

  • Add arguments / or any other mean to exclude some calculation (not too fine grained)

  • Document/Provide json template to get started, e.g. like https://github.com/micrometer-metrics/micrometer/tree/master/scripts/spring-dash

  • Mono might not need all metrics, in fact it does only need one: mono_duration. The malformed metrics can be useful too separately. But no need for onNext delay.

  • It would be interesting to tag the error like spring boot with exception=ClassName

  • Rework used labels to remove reactor_* prefix. See below an example prometheus trace:

reactor_subscribe_to_terminate_seconds_count{reactor_sequence_name="http://localhost:8082/details/{ticker}",reactor_sequence_type="Mono",reactor_termination_type="cancel",ticker="GOOG",} 0.0

Could be

mono_duration_seconds_count{flow="http://localhost:8082/details/{ticker}", status="cancelled",ticker="GOOG",} 0.0```
@smaldini smaldini added the type/enhancement A general enhancement label Jun 15, 2018
@simonbasle simonbasle added this to the Backlog milestone Jun 15, 2018
@simonbasle
Copy link
Member

@jkschneider I'd be interested on your feedback on the last point. are prefix like that generally a good or bad idea? does using tag names that are very generic, like status, put us at risk to collide non related data together?

@simonbasle simonbasle self-assigned this Jun 15, 2018
@simonbasle
Copy link
Member

  • Produce latency buckets, maybe with distribution support in MeterFilter, might need a reactor scoped metricsRegistry to add into the global registry. This should be available as an argument like spring-boot in order to macro monitor latency distribution for a sequence (Mono/Flux) or for onNext (Flux). It looks like that in boot (note the "le" tag):

my interpretation is that MeterFilter and activation of histograms is an external concern and not one that an instrumented library like Reactor should make. @jkschneider any insight?

  • I think request metrics are not correct as we use totals (requestCounter). It should be computed per Subscriber then added to the distribution, e.g. keeping track of the interval between 2 requests with the requested volume as a tag so metrics query can compute asked data/second.

We use a DistributionSummary, to which we provide regular counts, and the DistributionSummary will also record the timings and so the rate at which these requests come in.

My original thought against implementing this metric for un-named sequences was that there would already be a lot of noise. Furthermore with any sequence that gets a Long.MAX_VALUE request, most other requests would be just blips.

The problem I see with attempting to record that on a per-subscriber basis is that it means we need a differentiating tag per subscriber, and we don't really have anything meaningful to provide for that. Subscriber instance maybe? Or for named sequences, an incrementing number? Not sure what to do here...

  • Add arguments / or any other mean to exclude some calculation (not too fine grained)
    MeterFilter are perfect for that, I think. They replace meters with NO-OP instances if you deny the filter. Let the user choose 👍
  • Document/Provide json template to get started, e.g. like micrometer-metrics/micrometer:scripts/spring-dash@master

I'd like for someone with experience on these tools to provide something, but I definitely don't feel enough at ease with Grafana to come up with something like that. Need a contribution 😄

  • Mono might not need all metrics, in fact it does only need one: mono_duration. The malformed metrics can be useful too separately. But no need for onNext delay.

Doable, but implies to duplicate all the code (including fuseable versions) for Mono.

  • It would be interesting to tag the error like spring boot with exception=ClassName

Good idea!

  • Rework used labels to remove reactor_* prefix. See below an example prometheus trace:

For the meter names, I'm not sure (see my previous comment above). For tags, I think it makes more sense indeed 👍

simonbasle added a commit that referenced this issue Jun 15, 2018
 - split Mono implementation from Flux implementation
 - Mono doesn't record onNext delays nor requests
 - rename meter 'reactor.subscribe.to.terminate' to 'reactor.flow
 .duration'
 - remove 'reactor' prefix from tag names and rename some:
   - 'reactor.termination.type' to 'status'
   - 'reactor.sequence.name' to 'flow'
   - 'reactor.sequence.type' to 'type'
 - duration meter tagged with `status == ERROR` is lazily created to
   also be tagged with the Exception's class name
@jkschneider
Copy link

jkschneider commented Jun 15, 2018

My 2 cents:

Produce latency buckets...

I really try to encourage core libraries to just use a vanilla Timer and not be any more specific than that. The usability of latency buckets varies substantially from monitoring system to monitoring system, and there is a cost to it. Allow the end user to enable latency buckets programmatically through MeterFilter or declaratively in Spring via management.metrics.distribution.percentiles-histogram.reactor.

Removing reactor_* prefix.

I agree you should remove the prefix from tag keys. I do think it is important to consistently apply reactor. on the front of all metric names, however. We tend to prefix most OOTB metrics with the technology because the implied name taxonomy provides an easy way for folks to influence the set of all reactor metrics at once (e.g. add a tag, disable them, whitelist them).

Examples:

  1. Spring configuration to disable all reactor metrics: management.metrics.enable.reactor=false
  2. Controlling bucket tag cardinality when the user knows something about the max time of their reactor interactions (default is 2 minutes): MeterFilter.maxExpected("reactor", Duration.ofSeconds(1))

@simonbasle simonbasle modified the milestones: Backlog, 3.2.0.M3 Jun 20, 2018
@simonbasle simonbasle reopened this Jun 20, 2018
@simonbasle
Copy link
Member

simonbasle commented Jun 20, 2018

@smaldini regarding e013678 in the fused case, shouldn't we capture the negotiated request mode and do actual.onNext(null) in the case of ASYNC in onNext?
(that is have the fuseable versions override onNext and add ASYNC case to this line in Flux and this line in Mono)

edit: see PR #1250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

3 participants