-
-
Notifications
You must be signed in to change notification settings - Fork 335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] relation between scraping-interval, length, period #865
Comments
I've had the very same questions myself. After some investigation, I've created some slides to use internally in my organization. I'm happy to share them here - and if appreciated, I can contribute with PRs to the documentation to include these explanations. I'll wait for some feedback from the maintainers. Let me know if some of the info above is incorrect and I'll fix it. Whenever the info is correct, I can create the PRs to include this in the docs if desirable 😄 |
I'm not a maintainer but have been working with YACE/CloudWatch for a bit. @PerGon that's a great breakdown of the parameters and those tips and tricks are spot on! There's an extra piece that YACE supports,
Since it's incredibly unlikely that your scrapes are going to align with this recommendation you can set |
@PerGon, I love your slides they're really useful. I've opened today this issue/improvement (#985) which it's kind of related to this. My understanding, at least when scraping normal metrics (including average), is that YACE will always pick the latest one, see: |
@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line? |
(sorry the delay, was on vacations)
Sure, that link is pointing to this datadog documentation |
The "multiple data points per scrape" behavior seems wrong to me. It appears to make Quoting the screenshots above, emphasis mine:
Firstly, what is the point of Secondly, if YACE uses the most recent value, and discards all others, doesn't that result in data loss? If a "sum" metric counts how many of a given thing occur in a given time interval, then we must sum all datapoints to aggregate smaller time intervals into a larger one. For example, suppose I have this config:
YACE will, every 30s, retrieve the past 30s of datapoints with a 10s resolution (3 unique datapoints) I expect/want YACE to do this: But based on what I'm reading, YACE actually does this: This means that anything occurring in seconds 0 through 20 will be ignored. It will only export the sum of things which happened from seconds 20 to 30 in the interval: the last datapoint. |
A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)
found this while searching for some info on why the metric values are unexpected. I don't think it simply takes the last data point.... Also if only the last datapoint was considered then max/min would always be the same, no? |
I wish you were correct, but I'm worried because the slides shared above directly contradict that. And in this code, I'm seeing that it drops all datapoints except the most recent one. I certainly hope I am misunderstanding the code, but so far it seems like it's dropping all but the last data point. |
Another concerning tidbit from the code: There is code to aggregate datapoints for
|
Thanks for your reply. I never used Go but if it is anything like C, I think it is getting the address of the first element in the array no ? mappedResult.Datapoint = **&**metricDataResult.Values[0] so it depends on what the caller is doing with mappedResult I will try to get some time over the weekend to look better |
A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)
* Add ClusterName tag to all resources * Add missing Name tag to RDS cluster and instances * Install YACE in the metrics instance * Add needed policies and roles for YACE to work This allows the metrics instance to access accounts on behalf of the AWS account that created it, with the permissions specified in the metrics_policy_document data block * Remove unneeded formatted strings * Configure and enable YACE A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment) * Add new graphs to default and ES dashboards * Update default dashboard ID * Escape double curly braces in template file * Rename dashboard_data to default_dashboard_tmpl * make assets * Keep only StatusCheckFailed metrics for EC2 Set nilToZero to true as well to get metrics for all instances, since 0 is the success value. * Add new panel reporting status of all instances * make assets * make assets (now with the actual changes) * Escape curly brances in tag_Name * And make assets again * Fix value mapping, metric can report values > 1 * make assets
Hi, I am trying to make sense of how to best use these options to enable monitoring for my infrastructure as best as possible.
scraping-interval seems the most obvious, which is the frequency in which you'll refresh all metric data aka hit the aws cloudwatch api for data.
Period and length confuse me a bit more and seem to make sense only for metrics with averages.
If you're not looking at average metrics however it seems that you'll be limited to the scraping interval for overall accuracy.
Can someone explain this a bit more? If I can be made to understand it I'll put a PR for the docs!
The relation between the 'job-default' and 'metric-default' will likely make sense when this is explained where I imagine the more atomic user-defined value is accepted.
Please and thank you
The text was updated successfully, but these errors were encountered: