Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: implement UpstreamLocalityStats.loadMetricStats #18534

Merged
merged 8 commits into from
Jan 14, 2022

Conversation

pianiststickman
Copy link
Contributor

@pianiststickman pianiststickman commented Oct 8, 2021

Implements a simple API allowing extensions to populate UpstreamLocalityStats.loadMetricStats in load reports, stored as a flat hash map of keys and values that may vary at runtime. Individual stats are accumulated by calling add(), which combines stats with the same name; the aggregated stats are retrieved by calling latch(), which also clears the current load metrics. The load metrics are latched and read by the load stats reporter.

Signed-off-by: Eugene Chan eugenechan@google.com

Commit Message: upstream: implement UpstreamLocalityStats.loadMetricStats
Additional Description: Implements a simple API allowing extensions to populate UpstreamLocalityStats.loadMetricStats in load reports, stored as a flat hash map of keys and values that may vary at runtime. Individual stats are accumulated by calling add(), which combines stats with the same name; the aggregated stats are retrieved by calling latch(), which also clears the current load metrics. The load metrics are latched and read by the load stats reporter.

Note: this implementation puts a copy of the stat name into every host that receives a copy of that metric. This can be improved by putting a single copy of the stat name into a thread-local key->index map and using the index as the key to the stat map instead.
Risk Level: low
Testing: CI
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

UpstreamLocalityStats.loadMetricStats in load reports

Signed-off-by: Eugene Chan <eugenechan@google.com>
@pianiststickman
Copy link
Contributor Author

cc @htuch @jmarantz

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm missing some context as I was not aware of the load-stats-reporter infrastructure.

Where do these stats get consumed?

envoy/upstream/host_description.h Outdated Show resolved Hide resolved
source/common/upstream/load_stats_reporter.cc Outdated Show resolved Hide resolved
envoy/upstream/host_description.h Outdated Show resolved Hide resolved
source/common/upstream/load_stats_reporter.cc Outdated Show resolved Hide resolved
Signed-off-by: Eugene Chan <eugenechan@google.com>
@pianiststickman
Copy link
Contributor Author

I think I'm missing some context as I was not aware of the load-stats-reporter infrastructure.

Where do these stats get consumed?

These stats are sent as part of LoadStatsRequest to the management server. (See https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/load_report.proto#envoy-v3-api-msg-config-endpoint-v3-upstreamlocalitystats)

@pianiststickman
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18534 (comment) was created by @pianiststickman.

see: more, trace.

@snowp
Copy link
Contributor

snowp commented Oct 27, 2021

@jmarantz @htuch ping on this since it seems to have stalled. Who owns LRS nowadays?

@htuch
Copy link
Member

htuch commented Oct 29, 2021

I think maybe @adisuissa can take this one, since LRS is a pretty important part of control plane.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this.
Left a few comments with some questions.

envoy/upstream/host_description.h Outdated Show resolved Hide resolved
envoy/upstream/host_description.h Show resolved Hide resolved
source/common/upstream/load_stats_reporter.cc Outdated Show resolved Hide resolved
envoy/upstream/host_description.h Outdated Show resolved Hide resolved
envoy/upstream/host_description.h Outdated Show resolved Hide resolved
Signed-off-by: Eugene Chan <eugenechan@google.com>
Signed-off-by: Eugene Chan <eugenechan@google.com>
@pianiststickman
Copy link
Contributor Author

Thanks, PTAL

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18534 (comment) was created by @pianiststickman.

see: more, trace.

Copy link
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM modulo the design comment.

@yanavlasov
Copy link
Contributor

@pianiststickman can you clarify comments please so we can submit it?

@yanavlasov
Copy link
Contributor

/wait

Signed-off-by: Eugene Chan <eugenechan@google.com>
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR has no effect on Envoy other than to make the HostDescription object 8 bytes bigger.

Can you put a pointer to a doc explaining how this can be used, and its scaling limitations, into C++ comments and the PR description?

Thank you!

source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
@rojkov
Copy link
Member

rojkov commented Nov 16, 2021

/wait

@jmarantz
Copy link
Contributor

/wait

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 30, 2021
Signed-off-by: Eugene Chan <eugenechan@google.com>
@pianiststickman
Copy link
Contributor Author

pianiststickman commented Jan 4, 2022

PTAL, and apologies for the delay on this. I've added comments explaining the performance considerations of the current implementation. I think since this is purely an implementation detail (putting key names in thread-local storage, eliminating contention on the locks used to hold the stat maps) I'd prefer to take a pragmatic approach and dispense with that when it becomes necessary. Let me know what you think.

@github-actions github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Jan 5, 2022
Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you merge main, maybe the coverage test will pass. Otherwise you might have to dive into the CI report and see what's wrong. Sometimes coverage fails because we have added lines of code that are not covered with the new tests, sometimes it's just a flake or a consequence of non-hermetic tests, since the coverage tests run all in one process.

source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
envoy/upstream/host_description.h Outdated Show resolved Hide resolved
source/common/upstream/upstream_impl.h Outdated Show resolved Hide resolved
* Implementation of LoadMetricStats.
*
* TODO(pianiststickman): this implementation puts a copy of the stat name into every host that
* receives a copy of that metric. This can be improved by putting a single copy of the stat name
Copy link
Contributor

@jmarantz jmarantz Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note here that it takes a lock on the hot-path.

I really think this should not be deployed as is, but I understand why you want to add the interface and iterate on the perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

envoy/upstream/host_description.h Outdated Show resolved Hide resolved
test/common/upstream/load_stats_reporter_test.cc Outdated Show resolved Hide resolved
@jmarantz
Copy link
Contributor

jmarantz commented Jan 5, 2022

/wait

Signed-off-by: Eugene Chan <eugenechan@google.com>
@pianiststickman
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18534 (comment) was created by @pianiststickman.

see: more, trace.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign-from envoyproxy/senior-maintainers

ON_CALL(cm_, clusters()).WillByDefault(Return(cluster_info));
deliverLoadStatsResponse({"foo"});
// First stats report on timer tick.
time_system_.setMonotonicTime(std::chrono::microseconds(4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using advanceTimeWait with a delta, rather than just setting the time to an absolute time.

@repokitteh-read-only
Copy link

envoyproxy/senior-maintainers assignee is @htuch

🐱

Caused by: a #18534 (review) was submitted by @jmarantz.

see: more, trace.

@@ -218,6 +218,27 @@ HostVector filterHosts(const absl::node_hash_set<HostSharedPtr>& hosts,

} // namespace

// TODO(pianiststickman): this implementation takes a lock on the hot path and puts a copy of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to call out this warning so the senior-maintainer sees it.

We have more detailed notes on how to avoid taking a lock on every request for this use-case, but @pianiststickman wants to merge this before building out the thread-local impl to establish the API even with a potential perf bottleneck and then iterate.

Note that this code can't be reached in production without adding new C++ APIs, so it will not impact existing Envoy deployments.

@htuch
Copy link
Member

htuch commented Jan 13, 2022

We need another senior maintainer, since both Josh and PR author are Googlers :)

@htuch
Copy link
Member

htuch commented Jan 13, 2022

/assign-from envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link

envoyproxy/senior-maintainers assignee is @htuch

🐱

Caused by: a #18534 (comment) was created by @htuch.

see: more, trace.

@htuch htuch removed their assignment Jan 13, 2022
@htuch
Copy link
Member

htuch commented Jan 13, 2022

/assign-from envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link

envoyproxy/senior-maintainers assignee is @zuercher

🐱

Caused by: a #18534 (comment) was created by @htuch.

see: more, trace.

Copy link
Member

@zuercher zuercher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me given the proviso that the performance issue addressed before this becomes available for general use.

@htuch htuch merged commit fe9dc99 into envoyproxy:main Jan 14, 2022
joshperry pushed a commit to joshperry/envoy that referenced this pull request Feb 13, 2022
…#18534)

Implements a simple API allowing extensions to populate UpstreamLocalityStats.loadMetricStats in load reports, stored as a flat hash map of keys and values that may vary at runtime. Individual stats are accumulated by calling add(), which combines stats with the same name; the aggregated stats are retrieved by calling latch(), which also clears the current load metrics. The load metrics are latched and read by the load stats reporter.

Note: this implementation puts a copy of the stat name into every host that receives a copy of that metric. This can be improved by putting a single copy of the stat name into a thread-local key->index map and using the index as the key to the stat map instead.

Risk Level: low
Testing: CI
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Eugene Chan <eugenechan@google.com>
Signed-off-by: Josh Perry <josh.perry@mx.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants