Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Request Durations in Load Stats Reporting #11599

Open
utsav-dbx opened this issue Jun 16, 2020 · 13 comments
Open

Proposal: Request Durations in Load Stats Reporting #11599

utsav-dbx opened this issue Jun 16, 2020 · 13 comments
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!

Comments

@utsav-dbx
Copy link

I'm pretty new to Envoy, so apologies in advance if this is a silly question.

Envoy's load stats reporting currently reports useful metrics like total_successful_requests and total_requests_in_progress. I'm looking to implement a server that can estimate the long term average number of requests in our system via Little's Law. For that, I think we'd need total_requests_in_progress, and some notion of the average processing time of each request, broken down by cluster.

Envoy currently reports statistics that include the histogram upstream_rq_time, which is precisely what we'd need.

I wonder if this metric, or simply an "average request time" per cluster, should be threaded through to the load reporting stats. If yes, any pointers on a good way to implement it would be appreciated, and I can send a PR. Otherwise, the alternate path I see is that we need our server to be a full fledged gRPC metrics sink, which would work, but seems like a hack.

I also noticed some work around Orca and I wonder if the recommendation is to wait until that work is complete.

@htuch
Copy link
Member

htuch commented Jun 16, 2020

I think it's totally reasonable to add this to LRS, I'd recommend trying to leverage what we have in stats (or plumbing new ones), since LRS basically builds on this, e.g. https://github.com/envoyproxy/envoy/blob/master/source/common/upstream/load_stats_reporter.cc#L85.

The main issue you'll encounter is that you want the stats such as average request time to be only for the reporting period, which is different to the aggregate histograms. So, you probably need to plumb some new bookkeeping for this, which follows the existing path of where these stats are updated.

@htuch htuch added enhancement Feature requests. Not bugs or questions. help wanted Needs help! labels Jun 16, 2020
@utsav-dbx
Copy link
Author

@htuch do you think it's feasible to send something like the p50, p75, and p95 request times? I see no way to latch a histogram value, but might be missing something.

@htuch
Copy link
Member

htuch commented Jun 16, 2020

Yeah, you'll have to reconcile the histogram on the main thread, since this isn't an atomic integer that can be latched. @jmarantz do you know the best place to snapshot some percentile integers for histograms on the main thread for periodic reporting purposes?

@jmarantz
Copy link
Contributor

@ramaraochavali might know better, but I think you can get this today by polling the 127.0.0.1:ADMIN_PORT/stats?filter=HISTOGRAM_NAME and then parsing the output.

This can also be done internally via the adminRequest API, though you could probably plumb through a more convenient API that didn't require serializing an ad-hoc format to a string and then parsing it.

@htuch
Copy link
Member

htuch commented Jun 16, 2020

The idea is to integrate this into the LRS stream, so polling admin is not the right solution.

@jmarantz
Copy link
Contributor

jmarantz commented Jun 17, 2020 via email

@utsav-dbx
Copy link
Author

So it seems like the most correct behavior would be:

When the load reporting timer goes off, we should dispatch an event to the main thread to take a snapshot of all histograms (and clear them), then dispatch an event (with some data) so a worker thread can take the frozen histograms, and process rest of the data. On those lines, we probably need to add a method to upstream_impl.h since it has the load_report_stats_store_ with the right histogram, and that method can take in a callback.

If this is correct, let me know. I was browsing the source to find a good example to use, and I think we should be using Event::Dispatcher.post() to freeze stats, but not sure how to invoke a callback on a worker thread.

@jmarantz
Copy link
Contributor

jmarantz commented Jun 17, 2020

IDK about clearing them. If you want to capture deltas can you just keep your own state for previous values and compute deltas?

From a flow perspective, start by looking at

void adminRequest(absl::string_view path_and_query, absl::string_view method,

Then via the admin stats infrastructure, that ultimately calls:

all_histograms.emplace(histogram->name(), histogram->quantileSummary());

I think you want to mirror that flow from a threading perspective, but rather than passing in an admin URL path and getting the stringified quantile summary from

virtual std::string quantileSummary() const PURE;

You might want to get this data in a structured way, e.g. from

virtual const std::vector<uint64_t>& computedBuckets() const PURE;

@jmarantz
Copy link
Contributor

Also, you did find the right post() API if you need to then get the data accessible on a particular worker thread. But I'm nervous about the attempt to 'freeze' stats. That sound difficult. I'd think instead about grabbing a reasonably coherent snapshot to the best extent possible using the threading flow I gave above.

@htuch
Copy link
Member

htuch commented Jun 17, 2020

The LRS reporter is running on the main thread, so you should not have to cross-threads here. You should rely on the histograms that have been aggregated there. I think the admin interface is a bit strange, this isn't an admin query but instead the goal is to get at the stats aggregate. But you can follow the logic for how the admin handler works.

@jmarantz
Copy link
Contributor

jmarantz commented Jun 17, 2020 via email

@utsav-dbx
Copy link
Author

utsav-dbx commented Jun 18, 2020

Thanks for the help. An interesting snag while implementing this is that the load reporting stats are an instance of IsolatedStoreImpl which do not return a ParentHistogramSharedPtr, since it doesn't have any ParentHistograms. Still digging into it a little more to see if that can be implemented.

@jtway
Copy link
Contributor

jtway commented Jun 2, 2022

So not sure if this belongs as soon thing to mention here, or if it should be a separate issue/feature request.

I know having latency information would be very valuable, including endpoint specific information to know if a particular endpoint in a cluster's performance is degrading. That being said, I feel like some combination of TTFB in addition to histograms for request time and size (or at least avgs) would be beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants