feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

jdheyburn · 2024-08-05T12:10:57Z

Summary

Telegraf's outputs.datadog plugin is incompatible with submitting statsd metrics alongside the Datadog agent, as the dd-agent submits counter metrics as rates instead. Telegraf does not do this in order to reflect raw data more accurately. See #10979 for that change.

This PR enables a new config should_rate_counts which will convert count metrics to rates via the defined rate_interval (defaults to 10s - the same default as the Datadog agent) if those metrics are eligible. See below for eligibility.

What does the Datadog agent do?

Counts
- Source
- Logic
- rate
Histogram
- Source
- Logic
- values
  - avg
    - gauge
  - count
    - rate
  - median
    - gauge
  - 95percentile
    - gauge
  - max
    - gauge

Histograms and Timings are synonymous.

Testing

I fired some test StatsD UDP packets to all these endpoints:

Datadog v7.48.1
Telegraf 1.31.2
Telegraf 1.32.0 (including this PR)

Count

# sent via datadog
while true; do
  echo "test.metric.datadog:1|c" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.metric.telegraf:1|c" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to datadog agent

Submitted to telegraf 1.32.0 (including this PR)

Timings

# sent via datadog
while true; do
  echo "test.timing.datadog:100|ms" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.timing.telegraf:100|ms|#region:us-west-2" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to Datadog agent

test.timing.datadog.95percentile
  - type: Gauge, interval: 10
test.timing.datadog.avg
   - type: Gauge, interval: 10
test.timing.datadog.count
   - type: Rate, interval: 10
test.timing.datadog.max
   - type: Gauge, interval: 10
test.timing.datadog.median
   - type: Gauge, interval: 10

Submitted to telegraf 1.32.0 (including this PR)

test.timing.telegraf.95percentile
  - type: Not assigned
test.timing.telegraf.avg
   - type: Not assigned
test.timing.telegraf.count
   - type: Rate, interval: 10
test.timing.telegraf.max
    - type: Not assigned 
test.timing.telegraf.median
     - type: Not assigned

Histogram

# sent via datadog
while true; do
  echo "test.histogram.datadog:100|ms" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.histogram.telegraf:100|ms|#region:us-west-2" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to Datadog agent

test.histogram.datadog.95percentile
  - type: Gauge, interval: 10
test.histogram.datadog.avg
   - type: Gauge, interval: 10
test.histogram.datadog.count
   - type: Rate, interval: 10
test.histogram.datadog.max
   - type: Gauge, interval: 10
test.histogram.datadog.median
   - type: Gauge, interval: 10

Submitted to telegraf fork

test.histogram.telegraf.95percentile
  - type: Not assigned
test.histogram.telegraf.avg
   - type: Not assigned
test.histogram.telegraf.count
   - type: Rate, interval: 10
test.histogram.telegraf.max
    - type: Not assigned 
test.histogram.telegraf.median
     - type: Not assigned

Histogram status quo in 1.31.2

test.histogram.telegraf.original.95percentile
  - type: Not assigned
test.histogram.telegraf.original.avg
   - type: Not assigned
test.histogram.telegraf.original.count
    - type: Not assigned
test.histogram.telegraf.original.max
    - type: Not assigned 
test.histogram.telegraf.original.median
     - type: Not assigned

Questions

Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

Other points

I extracted most of Write into convertToDatadogMetric so that I have a way to add unit tests
I wonder if I need a rate_interval config, and whether I can just use the interval defined at the root level
- My thinking was to make it as configurable as possible so that the user can override it
I have only tested inputs.statsd, since I assume this to be the most common approach for inputting metrics that supports datadog's dogstatsd library
- inputs.statsd creates timing and histograms as telegraf.Untyped

Checklist

No AI generated code was used in this PR

Related issues

resolves #13561

powersj

Hi,

Thanks for the PR! I like the approach of making this opt-in. I do wonder if we can avoid two options and instead enable this option when the interval is non-zero?

Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

This is a good question. I was wondering if I enable the rate interval, and I didn't want a rate, but wanted the count for something. Or does that use-case even make sense?

Some initial high-level comments below as well.

powersj · 2024-08-05T18:28:04Z

plugins/outputs/datadog/README.md

+  ## Convert counts to rates
+  ## Use this to be able to submit metrics from telegraf alongside Datadog agent
+  # should_rate_counts = true
+
+  ## When should_rate_counts is enabled, this overrides the
+  ## default (10s) rate interval used to divide count metrics by
+  # rate_interval = 20


The example config and readme should show the default values. In this case false and 10s. That way you don't have to document the default in the help text.

I do think we should get rid of should_rate_counts, and use a non-zero rate_interval, or like we do in stackdriver output, we have a metric_counter = [] which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?

If not getting rid of it, I would rename it to convert_count_to_rate or something. Using "should" in a config setting looks weird to me.

Appreciate the feedback.

we have a metric_counter = [] which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?

It's a good call out, but that would not scale in larger deployments such as the one I am working with that has 1000s of metrics.

I am happy with exploring having rate_interval default to 0, and if it is > 0 then enable conversion of counts to rates. My concern would be if users would know to set it to 10 - which is the default rate interval used by the Datadog agent - I can explicitly state this in the README though.

If it were to stay then I have no objections to changing should_rate_counts to convert_count_to_rate.

Let me know your thoughts 👍🏻

I'd still like to see the single option and you can document the suggested interval would be 10 seconds as it is default used by Datadog. Sound good?

SGTM - I've made the change to use single option in c808463. I also added a test in there for the status quo when rate_interval=0 👍🏻

powersj · 2024-08-05T18:35:46Z

plugins/outputs/datadog/datadog.go

@@ -220,6 +238,20 @@ func verifyValue(v interface{}) bool {
 	return true
 }

+func isRateable(metricType string, fieldName string) bool {


We have the use of "metricType" in a couple places, can you add a comment to this function that explains or specifies that we are looking for metric type from statsd tag for specific cases?

I think explicitly naming the variable statsDMetricType would help too.

// Retrieve the metric_type tag created by inputs.statsd statsDMetricType, _ := m.GetTag("metric_type") // ... if d.ShouldRateCounts && isRateable(statsDMetricType, fieldName) { // ... func isRateable(statsDMetricType string, fieldName string) bool { switch statsDMetricType { case "counter": return true case "timing", "histogram": return fieldName == "count" default: return false } }

I can then maybe put some better documentation in the README to explicitly say only inputs.statsd is supported. How does that sound?

That looks great and makes it much clearer, thank you!

Done in c808463

plugins/outputs/datadog/README.md

Hipska · 2024-08-06T08:24:04Z

plugins/outputs/datadog/README.md

+  ## Convert counts to rates
+  ## Use this to be able to submit metrics from telegraf alongside Datadog agent
+  # should_rate_counts = true
+
+  ## When should_rate_counts is enabled, this overrides the
+  ## default (10s) rate interval used to divide count metrics by
+  # rate_interval = 20


If not getting rid of it, I would rename it to convert_count_to_rate or something. Using "should" in a config setting looks weird to me.

plugins/outputs/datadog/datadog_test.go

jdheyburn · 2024-08-06T13:03:44Z

@powersj

Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

I am adverse to changing the status quo too much - I believe this is purely cosmetic on the Datadog UI anyway (see my screenshots earlier) and has little impact on the outcome.

I was wondering if I enable the rate interval, and I didn't want a rate, but wanted the count for something. Or does that use-case even make sense?

Would you mind expanding on this a bit more as I don't think I've understood the scenario.

plugins/outputs/datadog/datadog.go

jdheyburn · 2024-08-07T11:05:46Z

Note that once reviewers are happy I'll rebase and squash commits to make the checks pass.

Hipska · 2024-08-07T11:08:32Z

The checks aren't passing because the files aren't correctly formatted. Run make fmt to fix this.

Hipska · 2024-08-07T13:11:56Z

plugins/outputs/datadog/datadog.go

-						dogM[1] = dogM[1] / float64(interval)
+						rateIntervalSeconds := time.Duration(d.RateInterval).Seconds()
+						interval = int64(rateIntervalSeconds)
+						dogM[1] = dogM[1] / float64(rateIntervalSeconds)


Isn't rateIntervalSeconds already a float64? https://pkg.go.dev/time#Duration.Seconds

Suggested change

dogM[1] = dogM[1] / float64(rateIntervalSeconds)

dogM[1] = dogM[1] / rateIntervalSeconds

@Hipska 🤦🏻 Yes that's correct thank you, I've updated that now.

Hipska · 2024-08-07T13:13:43Z

PS; You can mark the conversations as resolved when applying the suggestions. (I cannot do this)

jdheyburn · 2024-08-07T13:17:05Z

PS; You can mark the conversations as resolved when applying the suggestions. (I cannot do this)

Thanks - I didn't want to pre-emptively resolve conversations that are not resolved :)

telegraf-tiger · 2024-08-07T13:41:38Z

Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip.
Downloads for additional architectures and packages are available below.

☺️ This pull request doesn't significantly change the Telegraf binary size (less than 1%)

📦 Click here to get additional PR build artifacts

Artifact URLs

DEB	RPM	TAR GZ	ZIP
amd64.deb	aarch64.rpm	darwin_amd64.tar.gz	windows_amd64.zip
arm64.deb	armel.rpm	darwin_arm64.tar.gz	windows_arm64.zip
armel.deb	armv6hl.rpm	freebsd_amd64.tar.gz	windows_i386.zip
armhf.deb	i386.rpm	freebsd_armv7.tar.gz
i386.deb	ppc64le.rpm	freebsd_i386.tar.gz
mips.deb	riscv64.rpm	linux_amd64.tar.gz
mipsel.deb	s390x.rpm	linux_arm64.tar.gz
ppc64el.deb	x86_64.rpm	linux_armel.tar.gz
riscv64.deb		linux_armhf.tar.gz
s390x.deb		linux_i386.tar.gz
		linux_mips.tar.gz
		linux_mipsel.tar.gz
		linux_ppc64le.tar.gz
		linux_riscv64.tar.gz
		linux_s390x.tar.gz

powersj

Thanks for working on this!

srebhan

Thanks @jdheyburn for your contribution!

jdheyburn · 2024-08-07T15:14:15Z

Thanks @Hipska and @powersj for your fast feedback - it makes contributing a lot easier!

telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins labels Aug 5, 2024

jdheyburn force-pushed the jdheyburn/outputs.datadog/support-counts-as-rate branch 2 times, most recently from d4d86ed to 7051a9c Compare August 5, 2024 12:41

feat(outputs.datadog): Add support for submitting alongside dd-agent

8d0f0d9

jdheyburn force-pushed the jdheyburn/outputs.datadog/support-counts-as-rate branch from 7051a9c to 8d0f0d9 Compare August 5, 2024 14:22

jdheyburn marked this pull request as ready for review August 5, 2024 14:22

powersj reviewed Aug 5, 2024

View reviewed changes

powersj self-assigned this Aug 5, 2024

Hipska suggested changes Aug 6, 2024

View reviewed changes

jdheyburn added 3 commits August 6, 2024 13:40

Run tests in their own namespace

00aaf14

Refactor tests

57ab41e

State requirements of rating counts

ab1fa87

Use non-zero rate_interval to enable rate logic

c808463

jdheyburn requested review from Hipska and powersj August 7, 2024 09:50

Hipska reviewed Aug 7, 2024

View reviewed changes

plugins/outputs/datadog/datadog.go Outdated Show resolved Hide resolved

plugins/outputs/datadog/datadog.go Outdated Show resolved Hide resolved

jdheyburn added 3 commits August 7, 2024 11:53

Switch RateInterval to config.Duration

d4e1381

Update docs

abe5452

Remove redundant metricCounter

9954a5f

jdheyburn added 2 commits August 7, 2024 13:44

Remove redundant metricCounter

4e217d1

Switch RateInterval to config.Duration

9bda584

Hipska reviewed Aug 7, 2024

View reviewed changes

Switch RateInterval to config.Duration

f569e8e

Hipska approved these changes Aug 7, 2024

View reviewed changes

powersj approved these changes Aug 7, 2024

View reviewed changes

powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Aug 7, 2024

powersj assigned srebhan and DStrand1 and unassigned powersj Aug 7, 2024

srebhan approved these changes Aug 7, 2024

View reviewed changes

srebhan added the area/statsd label Aug 7, 2024

srebhan merged commit 66a042f into influxdata:master Aug 7, 2024
27 checks passed

github-actions bot added this to the v1.32.0 milestone Aug 7, 2024

jdheyburn mentioned this pull request Sep 3, 2024

datadogexporter should have an option to submit counter metric as rate to match datadog-agent open-telemetry/opentelemetry-collector-contrib#17483

Open

BrewTestBot mentioned this pull request Sep 9, 2024

telegraf 1.32.0 Homebrew/homebrew-core#184076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

jdheyburn commented Aug 5, 2024 •

edited

Loading

powersj left a comment

powersj Aug 5, 2024

Hipska Aug 6, 2024

jdheyburn Aug 6, 2024

powersj Aug 6, 2024

jdheyburn Aug 7, 2024 •

edited

Loading

powersj Aug 5, 2024

jdheyburn Aug 6, 2024 •

edited

Loading

powersj Aug 6, 2024

jdheyburn Aug 7, 2024

Hipska Aug 6, 2024

jdheyburn commented Aug 6, 2024

jdheyburn commented Aug 7, 2024

Hipska commented Aug 7, 2024

Hipska Aug 7, 2024

jdheyburn Aug 7, 2024

Hipska commented Aug 7, 2024

jdheyburn commented Aug 7, 2024 •

edited

Loading

telegraf-tiger bot commented Aug 7, 2024

Artifact URLs

powersj left a comment

srebhan left a comment

jdheyburn commented Aug 7, 2024

	dogM[1] = dogM[1] / float64(rateIntervalSeconds)
	dogM[1] = dogM[1] / rateIntervalSeconds

feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

Conversation

jdheyburn commented Aug 5, 2024 • edited Loading

Summary

What does the Datadog agent do?

Testing

Count

Timings

Histogram

Questions

Other points

Checklist

Related issues

powersj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdheyburn Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdheyburn Aug 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jdheyburn commented Aug 6, 2024

jdheyburn commented Aug 7, 2024

Hipska commented Aug 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Hipska commented Aug 7, 2024

jdheyburn commented Aug 7, 2024 • edited Loading

telegraf-tiger bot commented Aug 7, 2024

Artifact URLs

powersj left a comment

Choose a reason for hiding this comment

srebhan left a comment

Choose a reason for hiding this comment

jdheyburn commented Aug 7, 2024

jdheyburn commented Aug 5, 2024 •

edited

Loading

jdheyburn Aug 7, 2024 •

edited

Loading

jdheyburn Aug 6, 2024 •

edited

Loading

jdheyburn commented Aug 7, 2024 •

edited

Loading