Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prom client output #3337

Merged
merged 1 commit into from
Oct 24, 2017
Merged

Conversation

jdoupe
Copy link
Contributor

@jdoupe jdoupe commented Oct 13, 2017

Required for all PRs:

  • [X ] Signed CLA.
  • [X ] Associated README.md updated.
  • [X ] Has appropriate unit tests.

UPDATE: After the merge of #3351, this PR now only addresses the handling of history and summary types.


This contains a couple fixes for the prometheus_client output.

  • The prometheus.http module is deprecated. Updated to use prometheus/promhttp.
  • The plugin renamed metrics by appending field names, even when that didn't make sense.
    For example, when using the prometheus input plugin, field names of "counter" and "gauge" were created (instead of just "value"), and then the metrics would have "counter" and "gauge" appended to them on the output.
  • When the prometheus input was used, the histogram and summary types were not handled gracefully. For histograms, bucket names were appended to the metrics and lost all meaning. For summarys, quantile names were appended to the metrics.

Upon de-mangling the metric names, it became apparent that when using the prometheus input in association with the prometheus output that the metrics created by the prometheus client library were in direct conflict with the same metrics from other scraped prometheus clients. (go_* and process_*). The conflict is not from the metric name itself, but the combination of the metric and the "HELP" text. The internal prometheus client library would generate the metrics with its default (and appropriate) HELP text, and then when a prometheus target was scraped and output by the prometheus client library in the plugin, it gives those same metric names a generic HELP text (because that text isn't saved anywhere in the telegraf data structures). Since the original HELP text doesn't match the generic HELP text for the same metric, the prometheus client library rejects it.

The fix presented is to disable to the collection of the internal client library metrics. They would be a bit conflated with the main telegraf process anyway, and not sure if they'd be directly meaningful.

However, I've put the unregistering of the metrics under the control of a configuration variable, which is false by default so as not to adversely affect existing installations any more than necessary.

@danielnelson
Copy link
Contributor

To prevent rejection, does the HELP string need to be the same or would it be enough if the TYPE string was equal?

@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 14, 2017

@danielnelson
Copy link
Contributor

@jdoupe I used your ideas and opened an alternative pull request #3351, it does both more and less than this pull request, but is more in line with how I want to address the issue (using the existing metric value types). There is more info over on that pull request, can you take a look and tell me what you think?

@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 18, 2017

I agree with your approach, and amazingly I did the same thing yesterday (make the prom input not drop the counter and gauge types) - you beat me to the punch. However, my main goal was to make the histogram and summary types "work" as they are my current problem.

@danielnelson
Copy link
Contributor

If you want you can force push your new PR here, or just close this and open a new one.

@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 18, 2017

Force pushed... yup, needs some clean up. I'll try to get some more time on it later or tomorrow.

@jdoupe jdoupe force-pushed the fix-prom-client-output branch 4 times, most recently from 03c734c to 771832f Compare October 19, 2017 14:20
@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 19, 2017

Rebased after merge of #3350...

accumulator.go Outdated
@@ -28,6 +28,18 @@ type Accumulator interface {
tags map[string]string,
t ...time.Time)

// AddCounter is the same as AddFields, but will add the metric as a "Counter" type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updated

accumulator.go Outdated
tags map[string]string,
t ...time.Time)

// AddCounter is the same as AddFields, but will add the metric as a "Counter" type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updated

@@ -226,6 +254,36 @@ func CreateSampleID(tags map[string]string) SampleID {
return SampleID(strings.Join(pairs, ","))
}

func (p *PrometheusClient) createFamily(pvt prometheus.ValueType, point telegraf.Metric, sample *Sample, mname string, sampleID SampleID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't always do what it says it does "createFamily", and its does a lot more. I think it probably needs to be something like func addMetricFamily(name string) MetricFamily and only do the first half and then another function to addSample(.

@@ -39,7 +45,10 @@ type MetricFamily struct {
// Samples are the Sample belonging to this MetricFamily.
Samples map[SampleID]*Sample
// Type of the Value.
ValueType prometheus.ValueType
PromValueType prometheus.ValueType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove the prometheus.ValueType then?

@@ -242,86 +300,123 @@ func (p *PrometheusClient) Write(metrics []telegraf.Metric) error {
labels[sanitize(k)] = v
}

// Prometheus doesn't have a string value type, so convert string
// fields to labels.
for fn, fv := range point.Fields() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this loop can stay, since it shouldn't interact with the different types and its theoretically possible to have a string added to a summary.

Side note: This is a symptom of how the ValueType was added later to our metrics, and how the types really should be per field not per metric. We will probably try to redo the Metrics class in the next year but for now we have to deal with it.

for fn, fv := range point.Fields() {
switch fn {
case "sum":
sum = fv.(float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to guard against this because we don't have full control that it will always be true, and in the other spots where we type assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one really got me. Type assertions suck, and switching on .(type) seemed to be the only solution, but as you noted, we don't have control what the type would be which would result in a pretty long set of cases. After rolling it around awhile (and not being a real in depth go guy), would doing something like strconv.ParseFloat(fmt.Sprint(fv), 64) be too silly or performance degrading? (Convert the value to a string because you don't need to know the type for that, and then convert the string to a float64. If it's not a number, presumably we've already converted it to a label.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't do the string conversion, but you could do the two value return form: f, ok := i.(float64). I guess we need to decide what we are going to do if a field is not a float (do we just drop it?).

You could also try switching over the type instead of the field name, this might end up reading better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current functionality is to drop anything but an int64 or float64. (https://github.com/influxdata/telegraf/blob/master/plugins/outputs/prometheus_client/prometheus_client.go#L254)

That seems to have worked thus far, perhaps we just keep going with it until someone finds otherwise. I've gone ahead and made the summary and histogram value handling the same as the gauge and counter handling.

default:
metric, err = prometheus.NewConstMetric(desc, family.PromValueType, sample.Value, labels...)
if err != nil {
log.Printf("E! Error creating prometheus metric, "+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could share the error handling and log message after the switch

@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 24, 2017

Tests failed. Wasn't me!
Executing 'make docker-run-circle'
docker run --name aerospike -p "3000:3000" -d aerospike/aerospike-server:3.9.0
Unable to find image 'aerospike/aerospike-server:3.9.0' locally
3.9.0: Pulling from aerospike/aerospike-server

@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 24, 2017

Working on fixing / adding tests...

@jdoupe jdoupe force-pushed the fix-prom-client-output branch 2 times, most recently from 64c7f7b to cad142c Compare October 24, 2017 19:30
@jdoupe
Copy link
Contributor Author

jdoupe commented Oct 24, 2017

Ready for review once again!

@danielnelson danielnelson added this to the 1.5.0 milestone Oct 24, 2017
@danielnelson danielnelson merged commit a6797a4 into influxdata:master Oct 24, 2017
@jdoupe jdoupe deleted the fix-prom-client-output branch October 25, 2017 01:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants