Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the histogram bucket bridge #3937

Merged
merged 16 commits into from
Aug 23, 2019
Merged

Conversation

mfpierre
Copy link
Contributor

@mfpierre mfpierre commented Jul 25, 2019

What does this PR do?

  • Add the histogram bucket bridge
  • Add sketch support to the check sampler

Motivation

Send prometheus/openmetrics histograms from checks to submit them as distribution metrics

Additional Notes

Performance on the vanilla linear interpolation implementation:

BenchmarkAddBucket1-4          	 1000000	      1394 ns/op	     850 B/op	      10 allocs/op
BenchmarkAddBucket100-4        	  100000	     11483 ns/op	    1217 B/op	      12 allocs/op
BenchmarkAddBucket10000-4      	    2000	    702976 ns/op	   42537 B/op	     224 allocs/op
BenchmarkAddBucket1000000-4    	      20	  66036465 ns/op	 4174570 B/op	   21495 allocs/op
BenchmarkAddBucket10000000-4   	       2	 679800800 ns/op	41738508 B/op	  214864 allocs/op

With an interpolation granularity of 1000 using insertN:

BenchmarkAddBucket1-4          	 1000000	      1652 ns/op	     874 B/op	      11 allocs/op
BenchmarkAddBucket10-4         	 1000000	      2461 ns/op	     909 B/op	      11 allocs/op
BenchmarkAddBucket100-4        	  100000	     12185 ns/op	    1241 B/op	      13 allocs/op
BenchmarkAddBucket1000-4       	   20000	     98839 ns/op	    4997 B/op	      32 allocs/op
BenchmarkAddBucket10000-4      	    5000	    397713 ns/op	   81300 B/op	     241 allocs/op
BenchmarkAddBucket1000000-4    	      50	  25824508 ns/op	 4185665 B/op	   12011 allocs/op
BenchmarkAddBucket10000000-4   	       5	 283451953 ns/op	96373139 B/op	   21017 allocs/op

This is speeding up the Summary computation but the sparseStore still has values inserted one by one.

Integrations-core PR: DataDog/integrations-core#4321

@mfpierre mfpierre added this to the 6.14.0 milestone Jul 25, 2019
@mfpierre mfpierre force-pushed the mfpierre/histogram-bucket-bridge branch 3 times, most recently from 6277c80 to eebc57c Compare July 26, 2019 11:57
@mfpierre mfpierre force-pushed the mfpierre/histogram-bucket-bridge branch 2 times, most recently from 35b45d8 to d4ae324 Compare August 9, 2019 09:53

// simple linear interpolation, TODO: optimize
if math.IsInf(bucket.UpperBound, 1) {
// Arbitrarily double the lower bucket value for interpolation over infinity bucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely going to require some documentation, is there any guideline or standard practice around this in the community?

Copy link
Contributor Author

@mfpierre mfpierre Aug 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find any guideline except what is currently done in the histogram_quantile function that requires the infinity bucket but will return the closest bucket if the quantile falls into it:

if the quantile falls into the highest bucket, the upper bound of the 2nd highest bucket is returned

ref https://github.com/prometheus/prometheus/blob/41151ca8dc069448515f48893b8631b9a3ad8df8/promql/quantile.go#L49-L70

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. That's confusing and unfortunate, but at least it's deterministic. I'll never understand why they don't just add a max metric. But anyway, for now we should mimc this I guess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbarciauskas I'm not sure how to emulate this behavior with the sketch, don't try to interpolate anything with the infinity bucket? (would mess up the summary)
Or interpolate everything close/at the value of "the upper bound of the 2nd highest" ? (close to what I'm currently doing)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mean report the count of values in the infinity bucket as the last defined bucket value, essentially as if they were all = max. Either choice (2nd to last bucket = max or 2*2nd to last bucket = max) is synthesizing data that doesn't exist. The first one means that query results should be similar at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just changed the logic to insert everything at the lower_bound of the infinity bucket

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choosing the lower_bound makes sense to me, but this needs to be very loudly documented since max is ~always going to be wrong in fairly surprising ways.

@mfpierre mfpierre force-pushed the mfpierre/histogram-bucket-bridge branch from 767da5f to 6121adf Compare August 16, 2019 15:26
@mfpierre mfpierre force-pushed the mfpierre/histogram-bucket-bridge branch from 6121adf to b39d8ad Compare August 16, 2019 16:29
@mfpierre mfpierre marked this pull request as ready for review August 16, 2019 17:36
@mfpierre mfpierre requested review from a team as code owners August 16, 2019 17:36
@mfpierre mfpierre requested a review from a team August 16, 2019 17:36
@mfpierre mfpierre force-pushed the mfpierre/histogram-bucket-bridge branch from b39d8ad to 42789bd Compare August 16, 2019 17:55
Name: ctx.Name,
Tags: ctx.Tags,
Host: ctx.Host,
// Interval: TODO: investigate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbarciauskas I'm not sure how to handle this field in the context of checks?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do checks run at a standard interval? I think it's only important for counts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbarciauskas they usually do (15 sec), but a custom check interval can be defined per check

@codecov
Copy link

codecov bot commented Aug 20, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@a654ff7). Click here to learn what that means.
The diff coverage is 70.3%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #3937   +/-   ##
=========================================
  Coverage          ?   52.11%           
=========================================
  Files             ?      631           
  Lines             ?    45673           
  Branches          ?        0           
=========================================
  Hits              ?    23801           
  Misses            ?    20478           
  Partials          ?     1394
Flag Coverage Δ
#linux 56.62% <71.62%> (?)
#windows 52.51% <70.3%> (?)
Impacted Files Coverage Δ
pkg/collector/python/init.go 3.87% <ø> (ø)
pkg/quantile/agent.go 72.22% <0%> (ø)
pkg/aggregator/mocksender/asserts.go 75.86% <0%> (ø)
pkg/metrics/histogram_bucket.go 0% <0%> (ø)
pkg/aggregator/mocksender/mocked_methods.go 13.79% <0%> (ø)
pkg/aggregator/mocksender/mocksender.go 95% <100%> (ø)
pkg/aggregator/context_resolver.go 92.3% <100%> (ø)
pkg/collector/python/test_aggregator.go 100% <100%> (ø)
pkg/aggregator/aggregator.go 63.59% <15%> (ø)
pkg/metrics/metric_sample.go 17.64% <40%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a654ff7...1b5b872. Read the comment docs.

release.json Outdated Show resolved Hide resolved
Copy link
Member

@hkaj hkaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything except the rt_loader part. Could you ask agent-core to have a look? I'm not sure my review would be useful there. Also interested in your opinion on insertN @jbarciauskas - or someone else familiar with the sketch implem

pkg/aggregator/check_sampler.go Outdated Show resolved Hide resolved
pkg/aggregator/check_sampler.go Outdated Show resolved Hide resolved
pkg/aggregator/sender.go Show resolved Hide resolved
pkg/collector/python/init.go Outdated Show resolved Hide resolved
a.Sketch.Basic.InsertN(v, n)

for i := 0; i < int(n); i++ {
a.Buf = append(a.Buf, agentConfig.key(v))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part would benefit from having bulk insert in the the sparse store

func (s *sparseStore) insert(c *Config, keys []Key) {

Copy link
Member

@arbll arbll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only a few nits here and there

pkg/aggregator/check_sampler.go Outdated Show resolved Hide resolved
rtloader/test/aggregator/aggregator.go Show resolved Hide resolved
---
features:
- |
[preview] Checks can now send histogram buckets to the agent to be sent as distribution metrics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this feature get enabled by default for people using wildcards in prometheus checks ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it'll be disabled by default so nothing should change if you're not enabling on purpose the send_distribution_buckets flag see DataDog/integrations-core#4321

Copy link
Contributor

@jbarciauskas jbarciauskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing feedback from @Daniel-B-Smith and I:

-The efficiency is poor and scales unintuitively (with counts), but substantial improvements are 1-2 weeks of effort at least
-There are potential accuracy improvements that will come along with that, but they will be incremental

If we're OK with those caveats, we're fine moving ahead with this and iterating on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants