Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to plot latency and request per second with opentelemetry's Histogram type? (Kind: Cumulative) #528

Closed
liufuyang opened this issue Nov 4, 2022 · 16 comments
Labels
priority: p1 question Further information is requested

Comments

@liufuyang
Copy link

liufuyang commented Nov 4, 2022

As you may know, this change been merged on the opentelemetry-go-contrib side recently to start reporting rpc.server.duration with meter created as

c.meter.SyncInt64().Histogram("rpc.server.duration", instrument.WithUnit(unit.Milliseconds))`

On our backend, we have a similar implementation. But when the data is exported to GoogleCloudMonitoring, we seem cannot find a good way to plot the latency graph.

The generated metric has a Kind: CUMULATIVE on it, as the picture 1 below, while comparing with a Google internal cloud run latency graph, the data has Kind: DELTA, see in picture 2.

So it is expected that the kind should be CUMULATIVE when GoogleCloudPlatform/opentelemetry-operations-go is used? And if so, how can I plot a latency graph on Google Monitoring?

Thank you :)


Extra info:

I am not sure what this aligner really means here but when I choose our metric exported from this package, there is a single delta to choose.

@dashpole dashpole added question Further information is requested priority: p1 labels Nov 4, 2022
@damemi
Copy link
Contributor

damemi commented Nov 4, 2022

It looks like delta aggregation for int64 values isn't permitted for custom metrics https://cloud.google.com/monitoring/api/v3/kinds-and-types#kind-type-combos. @dashpole do you have any context on that?

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

@dashpole do you have any context on that?

I don't.

@liufuyang what options are available for aggregation?

@liufuyang
Copy link
Author

liufuyang commented Nov 4, 2022

Do you mean what options do I see in the aggregator field?


Also that I noticed the UI looks different if I try to draw a graph on the Dashboard, the same metric is selected and it is on the same "advanced" tab.

@liufuyang
Copy link
Author

Hey there, sorry for the inconvenience, I think we fixed our issue by updating to the newest version einride/cloudrunner-go#340

Thank you for the help and I will just close it down for now. We can reopen this if we still find other problems related to using opentelemetry exporter with Histogram metrics.

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

Ah, glad to hear that resolved things.

@liufuyang
Copy link
Author

Thanks. By the way, since you are on top this now, do you know how to use MQL to draw or derive the request rate from the CUMULATIVE duration Histogram data?

I know that in PromQL something like this could do:

rate(workload_googleapis_com:rpc_server_duration_count{monitored_resource="generic_task" }[1m])

But on the MQL side, I am not sure how to do it. Thank you :)

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

Try the count aggregator?

@liufuyang
Copy link
Author

Aha, nice thank you very much :D

@liufuyang
Copy link
Author

Hmmm... it seems not working with the UI tool to set aggregator ad count? 🤔

image

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

Based on https://cloud.google.com/monitoring/charts/charting-distribution-metrics, it seems like maybe sum is what you want (but I would've expected sum to be the total time taken by requests). I may be mistaken

Alternatively, you can actually use promql to query these metrics if you want: https://cloud.google.com/stackdriver/docs/managed-prometheus/promql

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

(but sum also doesn't seem to do what I want either)

@dashpole
Copy link
Contributor

dashpole commented Nov 4, 2022

Actually, I think I found it. count_from seems to give the number of events in the distribution.

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| every 1m

Runs for me

@liufuyang
Copy link
Author

Aha, thank you very much I did like that and it indeed works

fetch generic_task
| metric 'workload.googleapis.com/rpc.server.duration'
| count_from
| rate
| group_by [metric.rpc_service, metric.rpc_grpc_code, resource.location],
    [value_duration_aggregate: aggregate(value_duration_count_from)]
| every 1m 

It plots the same graph (right) comparing counted by our other customized metric rpc_count (left), which is request per minute
image

@liufuyang
Copy link
Author

@dashpole Sorry to bother you again, I think I need the last bit of help here so we could use those metrics nicely in production. The question I have is how to plot the ratio between the two group's requests rate?

As we know above by using count_from and rate we could view the requests rate, now we would very much like to plot the error ratio.

I've tried it like this:

fetch generic_task
| metric 'workload.googleapis.com/rpc.server.duration'
| count_from
| rate
| filter_ratio_by [metric.rpc_service, resource.location], metric.rpc_grpc_code != 'OK'
| group_by sliding(5m), sum(val())
| condition val() > .05 '10^2.%'

But it gives a quite wrong-looking graph. What we need is basically during a time window, let's say 5 minutes, how many percentages of the requests have rpc_grpc_code not as OK.

It would be very appreciated if you could give us a hand on this. I've tried to read the doc but could not understand MQL well, also asked on Stackoverflow however not many know the answer I am afraid.

Thank you in advance.

@dashpole
Copy link
Contributor

dashpole commented Nov 21, 2022

@liufuyang I'm quite a bit out of my MQL depth, but I think you might want to do your group_by before you do your filter_ratio. If all of your ratios are 10%, but you have 5 streams, sum(val()) will output 50%. If you switch the order, you are summing the rates and errors (e.g. 10 req/ sec + 10 req/sec and 1 err / sec + 1 err /sec = 20 req / sec and 2 err / sec) first, and then computing a ratio.

When I tried your query on the rtt metric above:

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| filter_ratio_by [resource.instance_id], metric.remote_zone != 'us-central1-a'
| group_by sliding(5m), sum(val())

It gave me a graph with values between 0 and 5.

But if I changed it to:

fetch gce_instance
| metric 'networking.googleapis.com/vm_flow/rtt'
| count_from
| rate
| group_by sliding(5m), sum(val())
| filter_ratio_by [resource.instance_id], metric.remote_zone != 'us-central1-a'

It went to a graph with values between 0 and 1, which is what I expected to see from a ratio.

@liufuyang
Copy link
Author

Aha, thank you so much @dashpole, by switching the group_by and filter_ratio_by indeed gives us correct-looking results 👍

Super appreciated your help on this 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p1 question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants