-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing consul with multiple health checks per service #1994
Conversation
When service has more than one check sending data for both would overwrite each other resulting only in one check being written (the last one). Adding check_id as a tag ensures we will get info for all unique checks per service.
this will fix #1825? |
@sparrc Yes. From my testing it seems to solve this issue. Would be nice if issue author could verify this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harnash I'm concerned that adding check_id as a tag is going to lead to cardinality problems. Can you comment on how many unique values check_id can have? Please try to provide links to documentation which explain what check_id is and how consul determines new values to use.
@sparrc Regarding this and our discussion around mesos plugin I think we can't avoid those series becoming quite large and my approach to dealing with them initially was not optimal. In the end we decided to keep all telegraf service metrics in short DBRP (i.e. 24hrs) and run kapacitor against them and use Continuous Queries to aggregate them (significantly reducing time series count) and storing them in longer DBRP. We also talked with other companies and the approach this in similar way so we will give it a try soon. |
In that case are you OK with closing this PR? If I understood correctly it seems that this change may create measurements with a large cardinality |
@sparrc I do not believe that this change will by default create high cardinality problems in Influx. For this to happen user would need to have large quantity of fast changing checks and long DBRP. We can expect the same amount of series as with any other metric of services running in cloud environment (mesos, k8s, etc.). I've checked our numbers and for our production is generating around 2000 series (most services has only 1 health check). With this change we would get probably around 2500-3000 series. I think that is a reasonable. P.S. I'm taking care of my sick daughter so I may respond irregularly. |
OK, I'm going to merge this, but may revert if we receive reports that this has led to cardinality problems for users. |
@sparrc thank you! |
* plugins/input/consul: moved check_id from regular fields to tags. When service has more than one check sending data for both would overwrite each other resulting only in one check being written (the last one). Adding check_id as a tag ensures we will get info for all unique checks per service. * plugins/inputs/consul: updated tests
* plugins/input/consul: moved check_id from regular fields to tags. When service has more than one check sending data for both would overwrite each other resulting only in one check being written (the last one). Adding check_id as a tag ensures we will get info for all unique checks per service. * plugins/inputs/consul: updated tests
Required for all PRs:
This one is related to #1825.
Case when we have more than one check, plugin will try to write the following points:
This will end up as one point being written since timestamp and tags are the same.
I've decided to add check_id as a tag which will ensure we will properly write out info for all unique checks per service.
After this change points will be written like this:
ping: @bondido