Proposal: Request Durations in Load Stats Reporting #11599

utsav-dbx · 2020-06-16T05:42:45Z

I'm pretty new to Envoy, so apologies in advance if this is a silly question.

Envoy's load stats reporting currently reports useful metrics like total_successful_requests and total_requests_in_progress. I'm looking to implement a server that can estimate the long term average number of requests in our system via Little's Law. For that, I think we'd need total_requests_in_progress, and some notion of the average processing time of each request, broken down by cluster.

Envoy currently reports statistics that include the histogram upstream_rq_time, which is precisely what we'd need.

I wonder if this metric, or simply an "average request time" per cluster, should be threaded through to the load reporting stats. If yes, any pointers on a good way to implement it would be appreciated, and I can send a PR. Otherwise, the alternate path I see is that we need our server to be a full fledged gRPC metrics sink, which would work, but seems like a hack.

I also noticed some work around Orca and I wonder if the recommendation is to wait until that work is complete.

The text was updated successfully, but these errors were encountered:

htuch · 2020-06-16T13:32:25Z

I think it's totally reasonable to add this to LRS, I'd recommend trying to leverage what we have in stats (or plumbing new ones), since LRS basically builds on this, e.g. https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/load_stats_reporter.cc#L85.

The main issue you'll encounter is that you want the stats such as average request time to be only for the reporting period, which is different to the aggregate histograms. So, you probably need to plumb some new bookkeeping for this, which follows the existing path of where these stats are updated.

utsav-dbx · 2020-06-16T16:03:04Z

@htuch do you think it's feasible to send something like the p50, p75, and p95 request times? I see no way to latch a histogram value, but might be missing something.

htuch · 2020-06-16T19:52:43Z

Yeah, you'll have to reconcile the histogram on the main thread, since this isn't an atomic integer that can be latched. @jmarantz do you know the best place to snapshot some percentile integers for histograms on the main thread for periodic reporting purposes?

jmarantz · 2020-06-16T20:34:09Z

@ramaraochavali might know better, but I think you can get this today by polling the 127.0.0.1:ADMIN_PORT/stats?filter=HISTOGRAM_NAME and then parsing the output.

This can also be done internally via the adminRequest API, though you could probably plumb through a more convenient API that didn't require serializing an ad-hoc format to a string and then parsing it.

htuch · 2020-06-16T21:29:32Z

The idea is to integrate this into the LRS stream, so polling admin is not the right solution.

jmarantz · 2020-06-17T00:58:07Z

I'm not familiar with the LRS stream, but the way histograms work the data will be collected across multiple threads and the values sent to a callback on the main thread. The adminRequest API packages that for you. Http requests do not need to be involved if you are doing this from within the process.

…

On Tue, Jun 16, 2020, 5:29 PM htuch ***@***.***> wrote: The idea is to integrate this into the LRS stream, so polling admin is not the right solution. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11599 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO2IPJULDFHVI4QR22RR5DRW7P4TANCNFSM4N7IXCZA> .

utsav-dbx · 2020-06-17T02:34:36Z

So it seems like the most correct behavior would be:

When the load reporting timer goes off, we should dispatch an event to the main thread to take a snapshot of all histograms (and clear them), then dispatch an event (with some data) so a worker thread can take the frozen histograms, and process rest of the data. On those lines, we probably need to add a method to upstream_impl.h since it has the load_report_stats_store_ with the right histogram, and that method can take in a callback.

If this is correct, let me know. I was browsing the source to find a good example to use, and I think we should be using Event::Dispatcher.post() to freeze stats, but not sure how to invoke a callback on a worker thread.

jmarantz · 2020-06-17T13:20:34Z

IDK about clearing them. If you want to capture deltas can you just keep your own state for previous values and compute deltas?

From a flow perspective, start by looking at

envoy/source/exe/main_common.h

Line 65 in ad9245b

void adminRequest(absl::string_view path_and_query, absl::string_view method,

Then via the admin stats infrastructure, that ultimately calls:

envoy/source/server/admin/stats_handler.cc

Line 130 in ad9245b

all_histograms.emplace(histogram->name(), histogram->quantileSummary());

I think you want to mirror that flow from a threading perspective, but rather than passing in an admin URL path and getting the stringified quantile summary from

envoy/include/envoy/stats/histogram.h

Line 24 in ad9245b

virtual std::string quantileSummary() const PURE;

You might want to get this data in a structured way, e.g. from

envoy/include/envoy/stats/histogram.h

Line 53 in ad9245b

virtual const std::vector<uint64_t>& computedBuckets() const PURE;

jmarantz · 2020-06-17T13:24:24Z

Also, you did find the right post() API if you need to then get the data accessible on a particular worker thread. But I'm nervous about the attempt to 'freeze' stats. That sound difficult. I'd think instead about grabbing a reasonably coherent snapshot to the best extent possible using the threading flow I gave above.

htuch · 2020-06-17T17:29:55Z

The LRS reporter is running on the main thread, so you should not have to cross-threads here. You should rely on the histograms that have been aggregated there. I think the admin interface is a bit strange, this isn't an admin query but instead the goal is to get at the stats aggregate. But you can follow the logic for how the admin handler works.

jmarantz · 2020-06-17T17:45:22Z

Nothing in histograms is aggregated in the main thread until that's triggered by a stats-sink or /stats admin request. So you need to trigger it. The adminRequest call-stack can be duplicated and tightened up so you can initiate the cross-thread aggregation.

…

On Wed, Jun 17, 2020 at 1:30 PM htuch ***@***.***> wrote: The LRS reporter is running on the main thread, so you should not have to cross-threads here. You should rely on the histograms that have been aggregated there. I think the admin interface is a bit strange, this isn't an admin query but instead the goal is to get at the stats aggregate. But you can follow the logic for how the admin handler works. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11599 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAO2IPN3X7BTYRJXWDTOH6DRXD4SFANCNFSM4N7IXCZA> .

utsav-dbx · 2020-06-18T00:21:37Z

Thanks for the help. An interesting snag while implementing this is that the load reporting stats are an instance of IsolatedStoreImpl which do not return a ParentHistogramSharedPtr, since it doesn't have any ParentHistograms. Still digging into it a little more to see if that can be implemented.

jtway · 2022-06-02T13:20:15Z

So not sure if this belongs as soon thing to mention here, or if it should be a separate issue/feature request.

I know having latency information would be very valuable, including endpoint specific information to know if a particular endpoint in a cluster's performance is degrading. That being said, I feel like some combination of TTFB in addition to histograms for request time and size (or at least avgs) would be beneficial.

htuch added enhancement Feature requests. Not bugs or questions. help wanted Needs help! labels Jun 16, 2020

utsav-dbx mentioned this issue Jun 19, 2020

load reporter: best effort report request latencies #11650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Request Durations in Load Stats Reporting #11599

Proposal: Request Durations in Load Stats Reporting #11599

utsav-dbx commented Jun 16, 2020

htuch commented Jun 16, 2020

utsav-dbx commented Jun 16, 2020

htuch commented Jun 16, 2020

jmarantz commented Jun 16, 2020

htuch commented Jun 16, 2020

jmarantz commented Jun 17, 2020 via email

utsav-dbx commented Jun 17, 2020

jmarantz commented Jun 17, 2020 •

edited

Loading

jmarantz commented Jun 17, 2020

htuch commented Jun 17, 2020

jmarantz commented Jun 17, 2020 via email

utsav-dbx commented Jun 18, 2020 •

edited

Loading

jtway commented Jun 2, 2022

Proposal: Request Durations in Load Stats Reporting #11599

Proposal: Request Durations in Load Stats Reporting #11599

Comments

utsav-dbx commented Jun 16, 2020

htuch commented Jun 16, 2020

utsav-dbx commented Jun 16, 2020

htuch commented Jun 16, 2020

jmarantz commented Jun 16, 2020

htuch commented Jun 16, 2020

jmarantz commented Jun 17, 2020 via email

utsav-dbx commented Jun 17, 2020

jmarantz commented Jun 17, 2020 • edited Loading

jmarantz commented Jun 17, 2020

htuch commented Jun 17, 2020

jmarantz commented Jun 17, 2020 via email

utsav-dbx commented Jun 18, 2020 • edited Loading

jtway commented Jun 2, 2022

jmarantz commented Jun 17, 2020 •

edited

Loading

utsav-dbx commented Jun 18, 2020 •

edited

Loading