-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Request Durations in Load Stats Reporting #11599
Comments
I think it's totally reasonable to add this to LRS, I'd recommend trying to leverage what we have in stats (or plumbing new ones), since LRS basically builds on this, e.g. https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/load_stats_reporter.cc#L85. The main issue you'll encounter is that you want the stats such as average request time to be only for the reporting period, which is different to the aggregate histograms. So, you probably need to plumb some new bookkeeping for this, which follows the existing path of where these stats are updated. |
@htuch do you think it's feasible to send something like the p50, p75, and p95 request times? I see no way to |
Yeah, you'll have to reconcile the histogram on the main thread, since this isn't an atomic integer that can be latched. @jmarantz do you know the best place to snapshot some percentile integers for histograms on the main thread for periodic reporting purposes? |
@ramaraochavali might know better, but I think you can get this today by polling the 127.0.0.1:ADMIN_PORT/stats?filter=HISTOGRAM_NAME and then parsing the output. This can also be done internally via the adminRequest API, though you could probably plumb through a more convenient API that didn't require serializing an ad-hoc format to a string and then parsing it. |
The idea is to integrate this into the LRS stream, so polling admin is not the right solution. |
I'm not familiar with the LRS stream, but the way histograms work the data
will be collected across multiple threads and the values sent to a callback
on the main thread. The adminRequest API packages that for you. Http
requests do not need to be involved if you are doing this from within the
process.
…On Tue, Jun 16, 2020, 5:29 PM htuch ***@***.***> wrote:
The idea is to integrate this into the LRS stream, so polling admin is not
the right solution.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11599 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPJULDFHVI4QR22RR5DRW7P4TANCNFSM4N7IXCZA>
.
|
So it seems like the most correct behavior would be: When the load reporting timer goes off, we should dispatch an event to the main thread to take a snapshot of all histograms (and clear them), then dispatch an event (with some data) so a worker thread can take the frozen histograms, and process rest of the data. On those lines, we probably need to add a method to If this is correct, let me know. I was browsing the source to find a good example to use, and I think we should be using |
IDK about clearing them. If you want to capture deltas can you just keep your own state for previous values and compute deltas? From a flow perspective, start by looking at envoy/source/exe/main_common.h Line 65 in ad9245b
Then via the admin stats infrastructure, that ultimately calls: envoy/source/server/admin/stats_handler.cc Line 130 in ad9245b
I think you want to mirror that flow from a threading perspective, but rather than passing in an admin URL path and getting the stringified quantile summary from envoy/include/envoy/stats/histogram.h Line 24 in ad9245b
You might want to get this data in a structured way, e.g. from envoy/include/envoy/stats/histogram.h Line 53 in ad9245b
|
Also, you did find the right |
The LRS reporter is running on the main thread, so you should not have to cross-threads here. You should rely on the histograms that have been aggregated there. I think the admin interface is a bit strange, this isn't an admin query but instead the goal is to get at the stats aggregate. But you can follow the logic for how the admin handler works. |
Nothing in histograms is aggregated in the main thread until that's
triggered by a stats-sink or /stats admin request. So you need to trigger
it. The adminRequest call-stack can be duplicated and tightened up so you
can initiate the cross-thread aggregation.
…On Wed, Jun 17, 2020 at 1:30 PM htuch ***@***.***> wrote:
The LRS reporter is running on the main thread, so you should not have to
cross-threads here. You should rely on the histograms that have been
aggregated there. I think the admin interface is a bit strange, this isn't
an admin query but instead the goal is to get at the stats aggregate. But
you can follow the logic for how the admin handler works.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11599 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAO2IPN3X7BTYRJXWDTOH6DRXD4SFANCNFSM4N7IXCZA>
.
|
Thanks for the help. An interesting snag while implementing this is that the load reporting stats are an instance of |
So not sure if this belongs as soon thing to mention here, or if it should be a separate issue/feature request. I know having latency information would be very valuable, including endpoint specific information to know if a particular endpoint in a cluster's performance is degrading. That being said, I feel like some combination of TTFB in addition to histograms for request time and size (or at least avgs) would be beneficial. |
I'm pretty new to Envoy, so apologies in advance if this is a silly question.
Envoy's load stats reporting currently reports useful metrics like total_successful_requests and
total_requests_in_progress
. I'm looking to implement a server that can estimate the long term average number of requests in our system via Little's Law. For that, I think we'd needtotal_requests_in_progress
, and some notion of the average processing time of each request, broken down by cluster.Envoy currently reports statistics that include the histogram upstream_rq_time, which is precisely what we'd need.
I wonder if this metric, or simply an "average request time" per cluster, should be threaded through to the load reporting stats. If yes, any pointers on a good way to implement it would be appreciated, and I can send a PR. Otherwise, the alternate path I see is that we need our server to be a full fledged gRPC metrics sink, which would work, but seems like a hack.
I also noticed some work around Orca and I wonder if the recommendation is to wait until that work is complete.
The text was updated successfully, but these errors were encountered: