Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing consul with multiple health checks per service #1994

Merged
merged 3 commits into from
Dec 20, 2016
Merged

Fixing consul with multiple health checks per service #1994

merged 3 commits into from
Dec 20, 2016

Conversation

harnash
Copy link
Contributor

@harnash harnash commented Nov 4, 2016

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

This one is related to #1825.

Case when we have more than one check, plugin will try to write the following points:

consul_health_checks,host=wolfpit,node=consul-server-node check_id="serfHealth",check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902
consul_health_checks,host=wolfpit,node=consul-server-node check_id="serfHealth2",check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902

This will end up as one point being written since timestamp and tags are the same.
I've decided to add check_id as a tag which will ensure we will properly write out info for all unique checks per service.

After this change points will be written like this:

consul_health_checks,host=wolfpit,node=consul-server-node,check_id="serfHealth" check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902
consul_health_checks,host=wolfpit,node=consul-server-node,check_id="serfHealth2" check_name="Serf Health Status",service_id="",status="passing" 1464698464486439902

ping: @bondido

When service has more than one check sending data for both would overwrite each other
resulting only in one check being written (the last one). Adding check_id as a tag
ensures we will get info for all unique checks per service.
@sparrc
Copy link
Contributor

sparrc commented Nov 4, 2016

this will fix #1825?

@sparrc sparrc added this to the 1.2.0 milestone Nov 4, 2016
@harnash
Copy link
Contributor Author

harnash commented Nov 4, 2016

@sparrc Yes. From my testing it seems to solve this issue. Would be nice if issue author could verify this.

Copy link
Contributor

@sparrc sparrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harnash I'm concerned that adding check_id as a tag is going to lead to cardinality problems. Can you comment on how many unique values check_id can have? Please try to provide links to documentation which explain what check_id is and how consul determines new values to use.

@harnash
Copy link
Contributor Author

harnash commented Dec 18, 2016

@sparrc check_id in consul is by default a name of the check if not provided (see: https://www.consul.io/docs/agent/checks.html). Only requirement is that those check needs to be unique per consul node. So when service has multiple checks this is the best (if not only) way to differentiate between them. In reality it is up to the user how he names them and in case of some cases it may produce large numbers of series (many services + many checks) or service checks being defined automatically by a synchronising tool (such as marathon-consul).

Regarding this and our discussion around mesos plugin I think we can't avoid those series becoming quite large and my approach to dealing with them initially was not optimal. In the end we decided to keep all telegraf service metrics in short DBRP (i.e. 24hrs) and run kapacitor against them and use Continuous Queries to aggregate them (significantly reducing time series count) and storing them in longer DBRP. We also talked with other companies and the approach this in similar way so we will give it a try soon.

@sparrc
Copy link
Contributor

sparrc commented Dec 18, 2016

In that case are you OK with closing this PR? If I understood correctly it seems that this change may create measurements with a large cardinality

@harnash
Copy link
Contributor Author

harnash commented Dec 18, 2016

@sparrc I do not believe that this change will by default create high cardinality problems in Influx. For this to happen user would need to have large quantity of fast changing checks and long DBRP. We can expect the same amount of series as with any other metric of services running in cloud environment (mesos, k8s, etc.). I've checked our numbers and for our production is generating around 2000 series (most services has only 1 health check). With this change we would get probably around 2500-3000 series. I think that is a reasonable.

P.S. I'm taking care of my sick daughter so I may respond irregularly.

@sparrc
Copy link
Contributor

sparrc commented Dec 20, 2016

OK, I'm going to merge this, but may revert if we receive reports that this has led to cardinality problems for users.

@harnash
Copy link
Contributor Author

harnash commented Dec 20, 2016

@sparrc thank you!

njwhite pushed a commit to njwhite/telegraf that referenced this pull request Jan 31, 2017
* plugins/input/consul: moved check_id from regular fields to tags.

When service has more than one check sending data for both would overwrite each other
resulting only in one check being written (the last one). Adding check_id as a tag
ensures we will get info for all unique checks per service.

* plugins/inputs/consul: updated tests
maxunt pushed a commit that referenced this pull request Jun 26, 2018
* plugins/input/consul: moved check_id from regular fields to tags.

When service has more than one check sending data for both would overwrite each other
resulting only in one check being written (the last one). Adding check_id as a tag
ensures we will get info for all unique checks per service.

* plugins/inputs/consul: updated tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants