Fixing consul with multiple health checks per service #1994

harnash · 2016-11-04T13:25:47Z

Required for all PRs:

CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
Sign CLA (if not already signed)
README.md updated (if adding a new plugin)

This one is related to #1825.

Case when we have more than one check, plugin will try to write the following points:

consul_health_checks,host=wolfpit,node=consul-server-node check_id="serfHealth",check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902
consul_health_checks,host=wolfpit,node=consul-server-node check_id="serfHealth2",check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902

This will end up as one point being written since timestamp and tags are the same.
I've decided to add check_id as a tag which will ensure we will properly write out info for all unique checks per service.

After this change points will be written like this:

consul_health_checks,host=wolfpit,node=consul-server-node,check_id="serfHealth" check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902
consul_health_checks,host=wolfpit,node=consul-server-node,check_id="serfHealth2" check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902

ping: @bondido

When service has more than one check sending data for both would overwrite each other resulting only in one check being written (the last one). Adding check_id as a tag ensures we will get info for all unique checks per service.

sparrc · 2016-11-04T13:37:01Z

this will fix #1825?

harnash · 2016-11-04T14:20:45Z

@sparrc Yes. From my testing it seems to solve this issue. Would be nice if issue author could verify this.

sparrc

@harnash I'm concerned that adding check_id as a tag is going to lead to cardinality problems. Can you comment on how many unique values check_id can have? Please try to provide links to documentation which explain what check_id is and how consul determines new values to use.

harnash · 2016-12-18T20:36:14Z

@sparrc check_id in consul is by default a name of the check if not provided (see: https://www.consul.io/docs/agent/checks.html). Only requirement is that those check needs to be unique per consul node. So when service has multiple checks this is the best (if not only) way to differentiate between them. In reality it is up to the user how he names them and in case of some cases it may produce large numbers of series (many services + many checks) or service checks being defined automatically by a synchronising tool (such as marathon-consul).

Regarding this and our discussion around mesos plugin I think we can't avoid those series becoming quite large and my approach to dealing with them initially was not optimal. In the end we decided to keep all telegraf service metrics in short DBRP (i.e. 24hrs) and run kapacitor against them and use Continuous Queries to aggregate them (significantly reducing time series count) and storing them in longer DBRP. We also talked with other companies and the approach this in similar way so we will give it a try soon.

sparrc · 2016-12-18T20:53:12Z

In that case are you OK with closing this PR? If I understood correctly it seems that this change may create measurements with a large cardinality

harnash · 2016-12-18T21:19:14Z

@sparrc I do not believe that this change will by default create high cardinality problems in Influx. For this to happen user would need to have large quantity of fast changing checks and long DBRP. We can expect the same amount of series as with any other metric of services running in cloud environment (mesos, k8s, etc.). I've checked our numbers and for our production is generating around 2000 series (most services has only 1 health check). With this change we would get probably around 2500-3000 series. I think that is a reasonable.

P.S. I'm taking care of my sick daughter so I may respond irregularly.

sparrc · 2016-12-20T13:03:15Z

OK, I'm going to merge this, but may revert if we receive reports that this has led to cardinality problems for users.

harnash · 2016-12-20T18:40:42Z

@sparrc thank you!

* plugins/input/consul: moved check_id from regular fields to tags. When service has more than one check sending data for both would overwrite each other resulting only in one check being written (the last one). Adding check_id as a tag ensures we will get info for all unique checks per service. * plugins/inputs/consul: updated tests

sparrc added this to the 1.2.0 milestone Nov 4, 2016

harnash and others added 2 commits November 4, 2016 15:25

plugins/inputs/consul: updated tests

8b5f3ca

Merge branch 'master' into consul_plugin_fix

1609a67

sparrc suggested changes Dec 16, 2016

View reviewed changes

sparrc approved these changes Dec 20, 2016

View reviewed changes

sparrc merged commit 48ae105 into influxdata:master Dec 20, 2016

sparrc mentioned this pull request Dec 20, 2016

Consul input plugin with InfluxDB output - multiple checks per service #1825

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing consul with multiple health checks per service #1994

Fixing consul with multiple health checks per service #1994

harnash commented Nov 4, 2016

sparrc commented Nov 4, 2016

harnash commented Nov 4, 2016

sparrc left a comment

harnash commented Dec 18, 2016

sparrc commented Dec 18, 2016

harnash commented Dec 18, 2016

sparrc commented Dec 20, 2016

harnash commented Dec 20, 2016

Fixing consul with multiple health checks per service #1994

Fixing consul with multiple health checks per service #1994

Conversation

harnash commented Nov 4, 2016

Required for all PRs:

sparrc commented Nov 4, 2016

harnash commented Nov 4, 2016

sparrc left a comment

Choose a reason for hiding this comment

harnash commented Dec 18, 2016

sparrc commented Dec 18, 2016

harnash commented Dec 18, 2016

sparrc commented Dec 20, 2016

harnash commented Dec 20, 2016