Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(outputs.datadog): Add support for submitting alongside dd-agent #15702

Conversation

jdheyburn
Copy link
Contributor

@jdheyburn jdheyburn commented Aug 5, 2024

Summary

Telegraf's outputs.datadog plugin is incompatible with submitting statsd metrics alongside the Datadog agent, as the dd-agent submits counter metrics as rates instead. Telegraf does not do this in order to reflect raw data more accurately. See #10979 for that change.

This PR enables a new config should_rate_counts which will convert count metrics to rates via the defined rate_interval (defaults to 10s - the same default as the Datadog agent) if those metrics are eligible. See below for eligibility.

What does the Datadog agent do?

  • Counts
  • Histogram
    • Source
    • Logic
    • values
      • avg
        • gauge
      • count
        • rate
      • median
        • gauge
      • 95percentile
        • gauge
      • max
        • gauge

Histograms and Timings are synonymous.

Testing

I fired some test StatsD UDP packets to all these endpoints:

  • Datadog v7.48.1
  • Telegraf 1.31.2
  • Telegraf 1.32.0 (including this PR)

Count

# sent via datadog
while true; do
  echo "test.metric.datadog:1|c" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.metric.telegraf:1|c" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to datadog agent

image

Submitted to telegraf 1.32.0 (including this PR)

image

Timings

# sent via datadog
while true; do
  echo "test.timing.datadog:100|ms" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.timing.telegraf:100|ms|#region:us-west-2" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to Datadog agent

test.timing.datadog.95percentile
  - type: Gauge, interval: 10
test.timing.datadog.avg
   - type: Gauge, interval: 10
test.timing.datadog.count
   - type: Rate, interval: 10
test.timing.datadog.max
   - type: Gauge, interval: 10
test.timing.datadog.median
   - type: Gauge, interval: 10

image

image

Submitted to telegraf 1.32.0 (including this PR)

test.timing.telegraf.95percentile
  - type: Not assigned
test.timing.telegraf.avg
   - type: Not assigned
test.timing.telegraf.count
   - type: Rate, interval: 10
test.timing.telegraf.max
    - type: Not assigned 
test.timing.telegraf.median
     - type: Not assigned

image

image

Histogram

# sent via datadog
while true; do
  echo "test.histogram.datadog:100|ms" | nc -w 1 -u localhost 8125
  sleep 1;
done

## sent via telegraf

while true; do
  echo "test.histogram.telegraf:100|ms|#region:us-west-2" | nc -w 1 -u localhost 8125
  sleep 1;
done

Submitted to Datadog agent

test.histogram.datadog.95percentile
  - type: Gauge, interval: 10
test.histogram.datadog.avg
   - type: Gauge, interval: 10
test.histogram.datadog.count
   - type: Rate, interval: 10
test.histogram.datadog.max
   - type: Gauge, interval: 10
test.histogram.datadog.median
   - type: Gauge, interval: 10

Submitted to telegraf fork

test.histogram.telegraf.95percentile
  - type: Not assigned
test.histogram.telegraf.avg
   - type: Not assigned
test.histogram.telegraf.count
   - type: Rate, interval: 10
test.histogram.telegraf.max
    - type: Not assigned 
test.histogram.telegraf.median
     - type: Not assigned

image

image

Histogram status quo in 1.31.2

test.histogram.telegraf.original.95percentile
  - type: Not assigned
test.histogram.telegraf.original.avg
   - type: Not assigned
test.histogram.telegraf.original.count
    - type: Not assigned
test.histogram.telegraf.original.max
    - type: Not assigned 
test.histogram.telegraf.original.median
     - type: Not assigned

Questions

  • Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

Other points

  • I extracted most of Write into convertToDatadogMetric so that I have a way to add unit tests
  • I wonder if I need a rate_interval config, and whether I can just use the interval defined at the root level
    • My thinking was to make it as configurable as possible so that the user can override it
  • I have only tested inputs.statsd, since I assume this to be the most common approach for inputting metrics that supports datadog's dogstatsd library
    • inputs.statsd creates timing and histograms as telegraf.Untyped

Checklist

  • No AI generated code was used in this PR

Related issues

resolves #13561

@telegraf-tiger telegraf-tiger bot added feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins labels Aug 5, 2024
@jdheyburn jdheyburn force-pushed the jdheyburn/outputs.datadog/support-counts-as-rate branch 2 times, most recently from d4d86ed to 7051a9c Compare August 5, 2024 12:41
@jdheyburn jdheyburn force-pushed the jdheyburn/outputs.datadog/support-counts-as-rate branch from 7051a9c to 8d0f0d9 Compare August 5, 2024 14:22
@jdheyburn jdheyburn marked this pull request as ready for review August 5, 2024 14:22
Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

Thanks for the PR! I like the approach of making this opt-in. I do wonder if we can avoid two options and instead enable this option when the interval is non-zero?

Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

This is a good question. I was wondering if I enable the rate interval, and I didn't want a rate, but wanted the count for something. Or does that use-case even make sense?

Some initial high-level comments below as well.

Comment on lines 40 to 46
## Convert counts to rates
## Use this to be able to submit metrics from telegraf alongside Datadog agent
# should_rate_counts = true

## When should_rate_counts is enabled, this overrides the
## default (10s) rate interval used to divide count metrics by
# rate_interval = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example config and readme should show the default values. In this case false and 10s. That way you don't have to document the default in the help text.

I do think we should get rid of should_rate_counts, and use a non-zero rate_interval, or like we do in stackdriver output, we have a metric_counter = [] which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not getting rid of it, I would rename it to convert_count_to_rate or something. Using "should" in a config setting looks weird to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the feedback.

we have a metric_counter = [] which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?

It's a good call out, but that would not scale in larger deployments such as the one I am working with that has 1000s of metrics.

I am happy with exploring having rate_interval default to 0, and if it is > 0 then enable conversion of counts to rates. My concern would be if users would know to set it to 10 - which is the default rate interval used by the Datadog agent - I can explicitly state this in the README though.

If it were to stay then I have no objections to changing should_rate_counts to convert_count_to_rate.

Let me know your thoughts 👍🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd still like to see the single option and you can document the suggested interval would be 10 seconds as it is default used by Datadog. Sound good?

Copy link
Contributor Author

@jdheyburn jdheyburn Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM - I've made the change to use single option in c808463. I also added a test in there for the status quo when rate_interval=0 👍🏻

@@ -220,6 +238,20 @@ func verifyValue(v interface{}) bool {
return true
}

func isRateable(metricType string, fieldName string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the use of "metricType" in a couple places, can you add a comment to this function that explains or specifies that we are looking for metric type from statsd tag for specific cases?

Copy link
Contributor Author

@jdheyburn jdheyburn Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explicitly naming the variable statsDMetricType would help too.

// Retrieve the metric_type tag created by inputs.statsd
statsDMetricType, _ := m.GetTag("metric_type")
// ...
if d.ShouldRateCounts && isRateable(statsDMetricType, fieldName) {
// ...

func isRateable(statsDMetricType string, fieldName string) bool {
	switch statsDMetricType {
	case
		"counter":
		return true
	case
		"timing",
		"histogram":
		return fieldName == "count"
	default:
		return false
	}
}

I can then maybe put some better documentation in the README to explicitly say only inputs.statsd is supported. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks great and makes it much clearer, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c808463

plugins/outputs/datadog/README.md Outdated Show resolved Hide resolved
@powersj powersj self-assigned this Aug 5, 2024
Comment on lines 40 to 46
## Convert counts to rates
## Use this to be able to submit metrics from telegraf alongside Datadog agent
# should_rate_counts = true

## When should_rate_counts is enabled, this overrides the
## default (10s) rate interval used to divide count metrics by
# rate_interval = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not getting rid of it, I would rename it to convert_count_to_rate or something. Using "should" in a config setting looks weird to me.

plugins/outputs/datadog/datadog_test.go Show resolved Hide resolved
plugins/outputs/datadog/datadog_test.go Outdated Show resolved Hide resolved
plugins/outputs/datadog/datadog_test.go Outdated Show resolved Hide resolved
@jdheyburn
Copy link
Contributor Author

@powersj

Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?

I am adverse to changing the status quo too much - I believe this is purely cosmetic on the Datadog UI anyway (see my screenshots earlier) and has little impact on the outcome.

I was wondering if I enable the rate interval, and I didn't want a rate, but wanted the count for something. Or does that use-case even make sense?

Would you mind expanding on this a bit more as I don't think I've understood the scenario.

plugins/outputs/datadog/datadog.go Outdated Show resolved Hide resolved
plugins/outputs/datadog/datadog.go Outdated Show resolved Hide resolved
@jdheyburn
Copy link
Contributor Author

Note that once reviewers are happy I'll rebase and squash commits to make the checks pass.

@Hipska
Copy link
Contributor

Hipska commented Aug 7, 2024

The checks aren't passing because the files aren't correctly formatted. Run make fmt to fix this.

dogM[1] = dogM[1] / float64(interval)
rateIntervalSeconds := time.Duration(d.RateInterval).Seconds()
interval = int64(rateIntervalSeconds)
dogM[1] = dogM[1] / float64(rateIntervalSeconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't rateIntervalSeconds already a float64? https://pkg.go.dev/time#Duration.Seconds

Suggested change
dogM[1] = dogM[1] / float64(rateIntervalSeconds)
dogM[1] = dogM[1] / rateIntervalSeconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hipska 🤦🏻 Yes that's correct thank you, I've updated that now.

@Hipska
Copy link
Contributor

Hipska commented Aug 7, 2024

PS; You can mark the conversations as resolved when applying the suggestions. (I cannot do this)

@jdheyburn
Copy link
Contributor Author

jdheyburn commented Aug 7, 2024

PS; You can mark the conversations as resolved when applying the suggestions. (I cannot do this)

Thanks - I didn't want to pre-emptively resolve conversations that are not resolved :)

@telegraf-tiger
Copy link
Contributor

telegraf-tiger bot commented Aug 7, 2024

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

@powersj powersj added the ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. label Aug 7, 2024
@powersj powersj assigned srebhan and DStrand1 and unassigned powersj Aug 7, 2024
Copy link
Member

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdheyburn for your contribution!

@srebhan srebhan merged commit 66a042f into influxdata:master Aug 7, 2024
27 checks passed
@github-actions github-actions bot added this to the v1.32.0 milestone Aug 7, 2024
@jdheyburn
Copy link
Contributor Author

Thanks @Hipska and @powersj for your fast feedback - it makes contributing a lot easier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/statsd feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin plugin/output 1. Request for new output plugins 2. Issues/PRs that are related to out plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: [outputs.datadog] to support submitting metrics alongside Datadog Agents
5 participants