-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(outputs.datadog): Add support for submitting alongside dd-agent #15702
feat(outputs.datadog): Add support for submitting alongside dd-agent #15702
Conversation
d4d86ed
to
7051a9c
Compare
7051a9c
to
8d0f0d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Thanks for the PR! I like the approach of making this opt-in. I do wonder if we can avoid two options and instead enable this option when the interval is non-zero?
Should timings/histograms also submit the same metric type and interval as Datadog (gauge, 10) when should_rate_counts is enabled?
This is a good question. I was wondering if I enable the rate interval, and I didn't want a rate, but wanted the count for something. Or does that use-case even make sense?
Some initial high-level comments below as well.
plugins/outputs/datadog/README.md
Outdated
## Convert counts to rates | ||
## Use this to be able to submit metrics from telegraf alongside Datadog agent | ||
# should_rate_counts = true | ||
|
||
## When should_rate_counts is enabled, this overrides the | ||
## default (10s) rate interval used to divide count metrics by | ||
# rate_interval = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example config and readme should show the default values. In this case false
and 10s
. That way you don't have to document the default in the help text.
I do think we should get rid of should_rate_counts
, and use a non-zero rate_interval, or like we do in stackdriver output, we have a metric_counter = []
which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not getting rid of it, I would rename it to convert_count_to_rate
or something. Using "should" in a config setting looks weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appreciate the feedback.
we have a metric_counter = [] which means any metric matching that name will be marked as a counter. Should we do the same here and mark any metric matching that name as a rate?
It's a good call out, but that would not scale in larger deployments such as the one I am working with that has 1000s of metrics.
I am happy with exploring having rate_interval
default to 0, and if it is > 0 then enable conversion of counts to rates. My concern would be if users would know to set it to 10 - which is the default rate interval used by the Datadog agent - I can explicitly state this in the README though.
If it were to stay then I have no objections to changing should_rate_counts
to convert_count_to_rate
.
Let me know your thoughts 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see the single option and you can document the suggested interval would be 10 seconds as it is default used by Datadog. Sound good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM - I've made the change to use single option in c808463. I also added a test in there for the status quo when rate_interval=0
👍🏻
plugins/outputs/datadog/datadog.go
Outdated
@@ -220,6 +238,20 @@ func verifyValue(v interface{}) bool { | |||
return true | |||
} | |||
|
|||
func isRateable(metricType string, fieldName string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have the use of "metricType" in a couple places, can you add a comment to this function that explains or specifies that we are looking for metric type from statsd tag for specific cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think explicitly naming the variable statsDMetricType
would help too.
// Retrieve the metric_type tag created by inputs.statsd
statsDMetricType, _ := m.GetTag("metric_type")
// ...
if d.ShouldRateCounts && isRateable(statsDMetricType, fieldName) {
// ...
func isRateable(statsDMetricType string, fieldName string) bool {
switch statsDMetricType {
case
"counter":
return true
case
"timing",
"histogram":
return fieldName == "count"
default:
return false
}
}
I can then maybe put some better documentation in the README to explicitly say only inputs.statsd
is supported. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks great and makes it much clearer, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in c808463
plugins/outputs/datadog/README.md
Outdated
## Convert counts to rates | ||
## Use this to be able to submit metrics from telegraf alongside Datadog agent | ||
# should_rate_counts = true | ||
|
||
## When should_rate_counts is enabled, this overrides the | ||
## default (10s) rate interval used to divide count metrics by | ||
# rate_interval = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not getting rid of it, I would rename it to convert_count_to_rate
or something. Using "should" in a config setting looks weird to me.
I am adverse to changing the status quo too much - I believe this is purely cosmetic on the Datadog UI anyway (see my screenshots earlier) and has little impact on the outcome.
Would you mind expanding on this a bit more as I don't think I've understood the scenario. |
Note that once reviewers are happy I'll rebase and squash commits to make the checks pass. |
The checks aren't passing because the files aren't correctly formatted. Run |
plugins/outputs/datadog/datadog.go
Outdated
dogM[1] = dogM[1] / float64(interval) | ||
rateIntervalSeconds := time.Duration(d.RateInterval).Seconds() | ||
interval = int64(rateIntervalSeconds) | ||
dogM[1] = dogM[1] / float64(rateIntervalSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't rateIntervalSeconds
already a float64? https://pkg.go.dev/time#Duration.Seconds
dogM[1] = dogM[1] / float64(rateIntervalSeconds) | |
dogM[1] = dogM[1] / rateIntervalSeconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hipska 🤦🏻 Yes that's correct thank you, I've updated that now.
PS; You can mark the conversations as resolved when applying the suggestions. (I cannot do this) |
Thanks - I didn't want to pre-emptively resolve conversations that are not resolved :) |
Download PR build artifacts for linux_amd64.tar.gz, darwin_arm64.tar.gz, and windows_amd64.zip. 📦 Click here to get additional PR build artifactsArtifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdheyburn for your contribution!
Summary
Telegraf's outputs.datadog plugin is incompatible with submitting statsd metrics alongside the Datadog agent, as the dd-agent submits counter metrics as rates instead. Telegraf does not do this in order to reflect raw data more accurately. See #10979 for that change.
This PR enables a new config
should_rate_counts
which will convert count metrics to rates via the definedrate_interval
(defaults to 10s - the same default as the Datadog agent) if those metrics are eligible. See below for eligibility.What does the Datadog agent do?
Histograms and Timings are synonymous.
Testing
I fired some test StatsD UDP packets to all these endpoints:
Count
Submitted to datadog agent
Submitted to telegraf 1.32.0 (including this PR)
Timings
Submitted to Datadog agent
Submitted to telegraf 1.32.0 (including this PR)
Histogram
Submitted to Datadog agent
Submitted to telegraf fork
Histogram status quo in 1.31.2
Questions
should_rate_counts
is enabled?Other points
Write
intoconvertToDatadogMetric
so that I have a way to add unit testsrate_interval
config, and whether I can just use the interval defined at the root levelinputs.statsd
, since I assume this to be the most common approach for inputting metrics that supports datadog's dogstatsd librarytelegraf.Untyped
Checklist
Related issues
resolves #13561