[apm] standardize peer tag aggregation #20550

jdgumz · 2023-10-31T16:54:13Z

What does this PR do?

Builds on previous work by standardizing on a set of default peer tags over which to aggregate.
The tags themselves are an approved list of fields that are then converted to peer.* tags on the backend for use with trace metrics.

The previous peer_tags configuration can still be used to provide supplementary tags if necessary, but we intend for this flag to only be used in exceptional cases. This field cannot be used to supply arbitrary tags, as all tags ultimately are vetted in the backend.

Motivation

To make it easier for customers to adopt peer tags.

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Please ask me for the testing doc.

Reviewer's Checklist

…specific peer.service field

…ted configurations

maycmlee

Left a couple of suggestions

pkg/config/config_template.yaml

Co-authored-by: May Lee <mayl@alumni.cmu.edu>

jdgumz · 2023-10-31T20:11:32Z

Left a couple of suggestions

Thank you May! I have committed your suggestions.

…aDog/datadog-agent into apm/standardize-peer-tag-aggregation

This reverts commit fc01656.

pr-commenter · 2023-11-02T02:36:21Z

Bloop Bleep... Dogbot Here

Regression Detector Results

Run ID: 6decd9ed-9188-4a29-bd96-d3a4746cd3dc
Baseline: 1bf8059
Comparison: 0bcc362
Total datadog-agent CPUs: 7

Explanation

A regression test is an integrated performance test for datadog-agent in a repeatable rig, with varying configuration for datadog-agent. What follows is a statistical summary of a brief datadog-agent run for each configuration across SHAs given above. The goal of these tests are to determine quickly if datadog-agent performance is changed and to what degree by a pull request.

Because a target's optimization goal performance in each experiment will vary somewhat each time it is run, we can only estimate mean differences in optimization goal relative to the baseline target. We express these differences as a percentage change relative to the baseline target, denoted "Δ mean %". These estimates are made to a precision that balances accuracy and cost control. We represent this precision as a 90.00% confidence interval denoted "Δ mean % CI": there is a 90.00% chance that the true value of "Δ mean %" is in that interval.

We decide whether a change in performance is a "regression" -- a change worth investigating further -- if both of the following two criteria are true:

The estimated |Δ mean %| ≥ 5.00%. This criterion intends to answer the question "Does the estimated change in mean optimization goal performance have a meaningful impact on your customers?". We assume that when |Δ mean %| < 5.00%, the impact on your customers is not meaningful. We also assume that a performance change in optimization goal is worth investigating whether it is an increase or decrease, so long as the magnitude of the change is sufficiently large.
Zero is not in the 90.00% confidence interval "Δ mean % CI" about "Δ mean %". This statement is equivalent to saying that there is at least a 90.00% chance that the mean difference in optimization goal is not zero. This criterion intends to answer the question, "Is there a statistically significant difference in mean optimization goal performance?". It also means there is no more than a 10.00% chance this criterion reports a statistically significant difference when the true difference in mean optimization goal is zero -- a "false positive". We assume you are willing to accept a 10.00% chance of inaccurately detecting a change in performance when no true difference exists.

The table below, if present, lists those experiments that have experienced a statistically significant change in mean optimization goal performance between baseline and comparison SHAs with 90.00% confidence OR have been detected as newly erratic. Negative values of "Δ mean %" mean that baseline is faster, whereas positive values of "Δ mean %" mean that comparison is faster. Results that do not exhibit more than a ±5.00% change in their mean optimization goal are discarded. An experiment is erratic if its coefficient of variation is greater than 0.1. The abbreviated table will be omitted if no interesting change is observed.

No interesting changes in experiment optimization goals with confidence ≥ 90.00% and |Δ mean %| ≥ 5.00%.

Fine details of change detection per experiment.

experiment	goal	Δ mean %	Δ mean % CI	confidence
tcp_syslog_to_blackhole	ingress throughput	+1.92	[+1.78, +2.05]	100.00%
process_agent_real_time_mode	egress throughput	+0.82	[-1.70, +3.35]	40.90%
otel_to_otel_logs	ingress throughput	+0.33	[-1.26, +1.91]	26.58%
file_to_blackhole	egress throughput	+0.16	[-0.28, +0.60]	44.33%
trace_agent_json	ingress throughput	+0.03	[-0.10, +0.16]	32.80%
trace_agent_msgpack	ingress throughput	+0.03	[-0.09, +0.16]	32.98%
file_tree	egress throughput	+0.03	[-1.83, +1.89]	2.04%
dogstatsd_string_interner_8MiB_100k	ingress throughput	+0.02	[-0.02, +0.06]	52.05%
uds_dogstatsd_to_api	ingress throughput	+0.01	[-0.16, +0.19]	10.88%
dogstatsd_string_interner_8MiB_100	ingress throughput	+0.01	[-0.12, +0.14]	10.93%
tcp_dd_logs_filter_exclude	ingress throughput	+0.01	[-0.05, +0.06]	16.52%
dogstatsd_string_interner_64MiB_1k	ingress throughput	+0.00	[-0.13, +0.13]	2.09%
dogstatsd_string_interner_128MiB_1k	ingress throughput	+0.00	[-0.14, +0.14]	0.68%
dogstatsd_string_interner_8MiB_50k	ingress throughput	+0.00	[-0.04, +0.04]	0.00%
dogstatsd_string_interner_64MiB_100	ingress throughput	-0.00	[-0.14, +0.14]	0.18%
dogstatsd_string_interner_128MiB_100	ingress throughput	-0.00	[-0.14, +0.14]	0.31%
dogstatsd_string_interner_8MiB_1k	ingress throughput	-0.00	[-0.10, +0.10]	0.69%
idle	egress throughput	-0.01	[-2.47, +2.46]	0.33%
dogstatsd_string_interner_8MiB_10k	ingress throughput	-0.02	[-0.08, +0.04]	44.75%
process_agent_standard_check	egress throughput	-0.20	[-3.73, +3.32]	7.48%
process_agent_standard_check_with_stats	egress throughput	-0.42	[-2.43, +1.59]	26.73%

…o enable all default peer tags

jeremy-hanna

✅ for agent-shared-component owned files

knusbaum · 2023-11-07T18:40:37Z

pkg/config/config_template.yaml

+  ## @env DD_APM_PEER_TAGS_AGGREGATION - bool - default: false
+  ## [BETA] Enables aggregation of peer related tags (e.g., `peer.service`, `db.instance`, etc.) in the Agent.
+  ## If disabled, aggregated trace stats will not include these tags as dimensions on trace metrics.
+  ## For the best experience, Datadog also recommends enabling `compute_stats_by_span_kind`.


Presumably we recommend enabling compute_stats_by_span_kind if peer_tags_aggregation is enabled?

Do we recommend enabling peer_tags_aggregation?

Will people reading this know what it means?

I'll rephrase. The idea here is that if you're using peer_tags_aggregation, you will likely also want the span kind flag enabled too.

knusbaum · 2023-11-07T18:41:13Z

pkg/config/config_template.yaml

+  ## If disabled, aggregated trace stats will not include these tags as dimensions on trace metrics.
+  ## For the best experience, Datadog also recommends enabling `compute_stats_by_span_kind`.
+  ## If enabling both causes the Agent to consume too many resources, try disabling `compute_stats_by_span_kind` first.
+  ## If the overhead remains high, it will be due to a high cardinality of peer tags from the traces. You may need to check your instrumentation.


What should they be looking for in their instrumentation? Can we link them any documentation here?

I can provide some more guidance in this comment. We do not have clear documentation that would speak to this specific concern.

…r_tags_aggregation (#29089) **Description:** Deprecate peer_service_aggregation in favor of peer_tags_aggregation. Counterpart of DataDog/datadog-agent#20550.

…r_tags_aggregation (open-telemetry#29089) **Description:** Deprecate peer_service_aggregation in favor of peer_tags_aggregation. Counterpart of DataDog/datadog-agent#20550.

jdgumz added 7 commits October 27, 2023 11:49

add new config flag for peer tag aggregation, add peer tag defaults

4d6faa4

add peer_tags_aggregation config

bfb13e6

update aggregation to work with just peer tags, remove dependency on …

d25c225

…specific peer.service field

add tests for instantiating concentrator with different peer tag rela…

e570487

…ted configurations

fix tests for client stats aggregator

9651de3

correct inconsistent naming for peer tags aggregation variable

92adb54

fix configuration of peer tags aggregation flag

aa88f6e

jdgumz requested review from a team as code owners October 31, 2023 16:54

maycmlee reviewed Oct 31, 2023

View reviewed changes

pkg/config/config_template.yaml Outdated Show resolved Hide resolved

pkg/config/config_template.yaml Outdated Show resolved Hide resolved

pkg/config/config_template.yaml Outdated Show resolved Hide resolved

pkg/config/config_template.yaml Outdated Show resolved Hide resolved

jdgumz and others added 4 commits October 31, 2023 13:10

Update pkg/config/config_template.yaml

c484fa6

Co-authored-by: May Lee <mayl@alumni.cmu.edu>

Update pkg/config/config_template.yaml

400cb60

Co-authored-by: May Lee <mayl@alumni.cmu.edu>

Update pkg/config/config_template.yaml

1e59bed

Co-authored-by: May Lee <mayl@alumni.cmu.edu>

Update pkg/config/config_template.yaml

12b1cb2

Co-authored-by: May Lee <mayl@alumni.cmu.edu>

maycmlee approved these changes Nov 1, 2023

View reviewed changes

kitfre approved these changes Nov 1, 2023

View reviewed changes

jdgumz added 8 commits November 1, 2023 16:38

add _dd.base_service to default peer tags

1597550

Merge branch 'apm/standardize-peer-tag-aggregation' of github.com:Dat…

b253320

…aDog/datadog-agent into apm/standardize-peer-tag-aggregation

Merge branch 'main' into apm/standardize-peer-tag-aggregation

47c84b3

add split_payload field for client stats payload protobuf

fc01656

Revert "add split_payload field for client stats payload protobuf"

1f4037f

This reverts commit fc01656.

add splitPayload field to StatsPayload struct

ece387c

mark StatsPayloads with splitPayload field

4933933

add test cases for instantiation of client stats aggregator

5d2a44b

jdgumz added 3 commits November 2, 2023 09:47

make peer_service a reserved field

ce3e6af

remove old PeerService field references

c4a643a

remove PeerService field references in tests

6ffb5db

jdgumz added 5 commits November 2, 2023 12:42

add new benchmark

ee35a7d

revamp benchmark to account for peer tags

9e2c25f

correct logic for marking payloads as split

d7cf0bf

appease linter

93b6441

add db.system to defaults for peer tags

a34803e

jdgumz modified the milestones: 7.51.0, 7.50.0 Nov 3, 2023

jdgumz added the team/agent-apm trace-agent label Nov 3, 2023

jdgumz added 3 commits November 3, 2023 19:38

allow either peer service aggregation or peer tags aggregation flag t…

0bcc362

…o enable all default peer tags

ensure that we only retrieve peer tag when its value is non-empty

368559d

update default peer tags

236db04

jeremy-hanna approved these changes Nov 7, 2023

View reviewed changes

knusbaum reviewed Nov 7, 2023

View reviewed changes

jdgumz added 3 commits November 7, 2023 12:46

additional updates for peer tags list

c34a69d

revise documentation for flags related to peer tags aggregation

4f4546e

add release notes

26bc7ff

knusbaum approved these changes Nov 7, 2023

View reviewed changes

jdgumz merged commit e43ee12 into main Nov 8, 2023
135 checks passed

jdgumz deleted the apm/standardize-peer-tag-aggregation branch November 8, 2023 17:22

songy23 mentioned this pull request Nov 9, 2023

[exporter/datadog] Deprecate peer_service_aggregation in favor of peer_tags_aggregation open-telemetry/opentelemetry-collector-contrib#29089

Merged

ahmed-mez modified the milestones: 7.51.0, 7.50.0 Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apm] standardize peer tag aggregation #20550

[apm] standardize peer tag aggregation #20550

jdgumz commented Oct 31, 2023 •

edited

Loading

maycmlee left a comment

jdgumz commented Oct 31, 2023

pr-commenter bot commented Nov 2, 2023 •

edited

Loading

jeremy-hanna left a comment

knusbaum Nov 7, 2023

jdgumz Nov 7, 2023

knusbaum Nov 7, 2023

jdgumz Nov 7, 2023

[apm] standardize peer tag aggregation #20550

[apm] standardize peer tag aggregation #20550

Conversation

jdgumz commented Oct 31, 2023 • edited Loading

What does this PR do?

Motivation

Additional Notes

Possible Drawbacks / Trade-offs

Describe how to test/QA your changes

Reviewer's Checklist

maycmlee left a comment

Choose a reason for hiding this comment

jdgumz commented Oct 31, 2023

pr-commenter bot commented Nov 2, 2023 • edited Loading

Bloop Bleep... Dogbot Here

Regression Detector Results

jeremy-hanna left a comment

Choose a reason for hiding this comment

knusbaum Nov 7, 2023

Choose a reason for hiding this comment

jdgumz Nov 7, 2023

Choose a reason for hiding this comment

knusbaum Nov 7, 2023

Choose a reason for hiding this comment

jdgumz Nov 7, 2023

Choose a reason for hiding this comment

jdgumz commented Oct 31, 2023 •

edited

Loading

pr-commenter bot commented Nov 2, 2023 •

edited

Loading