Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] relation between scraping-interval, length, period #865

Open
jdix531 opened this issue Mar 27, 2023 · 10 comments
Open

[QUESTION] relation between scraping-interval, length, period #865

jdix531 opened this issue Mar 27, 2023 · 10 comments

Comments

@jdix531
Copy link

jdix531 commented Mar 27, 2023

Hi, I am trying to make sense of how to best use these options to enable monitoring for my infrastructure as best as possible.

scraping-interval seems the most obvious, which is the frequency in which you'll refresh all metric data aka hit the aws cloudwatch api for data.

Period and length confuse me a bit more and seem to make sense only for metrics with averages.

  • Period seems to be the requested accuracy of a metric ie. over the last how-many-seconds should this metric be considered.
  • Length would be the total time you want the periods to be considered (and it should be a multiple of the period).

If you're not looking at average metrics however it seems that you'll be limited to the scraping interval for overall accuracy.

Can someone explain this a bit more? If I can be made to understand it I'll put a PR for the docs!

The relation between the 'job-default' and 'metric-default' will likely make sense when this is explained where I imagine the more atomic user-defined value is accepted.

Please and thank you

@PerGon
Copy link
Contributor

PerGon commented Apr 21, 2023

I've had the very same questions myself. After some investigation, I've created some slides to use internally in my organization. I'm happy to share them here - and if appreciated, I can contribute with PRs to the documentation to include these explanations.

image

image

image

image

image

image

I'll wait for some feedback from the maintainers. Let me know if some of the info above is incorrect and I'll fix it. Whenever the info is correct, I can create the PRs to include this in the docs if desirable 😄

@kgeckhart
Copy link
Contributor

kgeckhart commented May 4, 2023

I'm not a maintainer but have been working with YACE/CloudWatch for a bit. @PerGon that's a great breakdown of the parameters and those tips and tricks are spot on!

There's an extra piece that YACE supports, roundingPeriod, which is good to consider. The GetMetricAPI docs have a recommendation for performance,

For better performance, specify StartTime and EndTime values that align with the value of the metric's Period and sync up with the beginning and end of an hour. For example, if the Period of a metric is 5 minutes, specifying 12:05 or 12:30 as EndTime can get a faster response from CloudWatch than setting 12:07 or 12:29 as the EndTime.

Since it's incredibly unlikely that your scrapes are going to align with this recommendation you can set roundingPeriod = period. This will ensure that if you scrape metrics at 7:07PM with a length and roundingPeriod of 5m the requested interval will be 7:00-7:05PM. The other added benefit of this approach is that data exported by YACE should be able to be reproduced in CloudWatch. CloudWatch will only graph data according to these best practices so if you graph a metric with a period of 5m in CloudWatch you are going to see data points on the 5m mark.

@luismy
Copy link
Contributor

luismy commented Jun 10, 2023

@PerGon, I love your slides they're really useful. I've opened today this issue/improvement (#985) which it's kind of related to this. My understanding, at least when scraping normal metrics (including average), is that YACE will always pick the latest one, see:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/c7807a770bb427f8ddb2c7becac51185fb3e8230/pkg/clients/cloudwatch/v1/client.go#L124-L125

@bafulton
Copy link

bafulton commented Jul 18, 2024

@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?

@PerGon
Copy link
Contributor

PerGon commented Aug 9, 2024

(sorry the delay, was on vacations)

@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?

Sure, that link is pointing to this datadog documentation

@cspotcode
Copy link

cspotcode commented Aug 28, 2024

The "multiple data points per scrape" behavior seems wrong to me. It appears to make period pointless, and also causes data loss for sum metrics.

Quoting the screenshots above, emphasis mine:

How YACE deals with multiple data points per AWS CloudWatch scrape

When YACE does 1 cloudwatch scrape, it can, possibly, get various points for the same metric. But YACE will only expose one single value. So this is what it does to reach the value to expose (source code):

If "statistic" is of type "Average"
YACE will calculate the average of all those data points and use that average

If any other "statistic" (Maximum, Sum, etc)
Use the value of the most recent data point

Firstly, what is the point of period if YACE is going to merge all those datapoints down to one? Why request a lower period granularity than scraping-interval if YACE can only submit a single datapoint per scraping-interval?

Secondly, if YACE uses the most recent value, and discards all others, doesn't that result in data loss? If a "sum" metric counts how many of a given thing occur in a given time interval, then we must sum all datapoints to aggregate smaller time intervals into a larger one.

For example, suppose I have this config:

  • scraping-interval: 30s
  • period: 10s
  • length: 30s
  • metric statistic: sum

YACE will, every 30s, retrieve the past 30s of datapoints with a 10s resolution (3 unique datapoints)

I expect/want YACE to do this:
Export all 3 datapoints, allowing my downstream dashboard tools to sum them into a total.

But based on what I'm reading, YACE actually does this:
Submit the last datapoint, discard the first 2.

This means that anything occurring in seconds 0 through 20 will be ignored. It will only export the sum of things which happened from seconds 20 to 30 in the interval: the last datapoint.

agarciamontoro added a commit to mattermost/mattermost-load-test-ng that referenced this issue Sep 9, 2024
A good resource on how to configure the delay, period and length:
prometheus-community/yet-another-cloudwatch-exporter#865 (comment)
@mtcoder24
Copy link

mtcoder24 commented Sep 9, 2024

found this while searching for some info on why the metric values are unexpected.

I don't think it simply takes the last data point....
with sum, i think it sums all the datapoints.

Also if only the last datapoint was considered then max/min would always be the same, no?
or am i missing something ?
EDIT: maybe YACE is applying the statistic ON the set of "last data point". That would make sense and the length serves only to specify how far in the past to look for data points

@cspotcode
Copy link

I don't think it simply takes the last data point.... with sum, i think it sums all the datapoints.

I wish you were correct, but I'm worried because the slides shared above directly contradict that.

And in this code, I'm seeing that it drops all datapoints except the most recent one.
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/clients/cloudwatch/v2/client.go#L145-L156

I certainly hope I am misunderstanding the code, but so far it seems like it's dropping all but the last data point.

@cspotcode
Copy link

Another concerning tidbit from the code: There is code to aggregate datapoints for GetMetricStatistic calls, but the GetMetricData codepaths do not do that!

GetMetricStatistic codepath:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L155-L186

GetMetricData codepath, single datapoint, I think this relates to the code linked in my previous comment:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L149-L151

@mtcoder24
Copy link

Thanks for your reply.

I never used Go but if it is anything like C, I think it is getting the address of the first element in the array no ?

mappedResult.Datapoint = **&**metricDataResult.Values[0]

so it depends on what the caller is doing with mappedResult

I will try to get some time over the weekend to look better

agarciamontoro added a commit to mattermost/mattermost-load-test-ng that referenced this issue Sep 12, 2024
A good resource on how to configure the delay, period and length:
prometheus-community/yet-another-cloudwatch-exporter#865 (comment)
agarciamontoro added a commit to mattermost/mattermost-load-test-ng that referenced this issue Sep 23, 2024
* Add ClusterName tag to all resources

* Add missing Name tag to RDS cluster and instances

* Install YACE in the metrics instance

* Add needed policies and roles for YACE to work

This allows the metrics instance to access accounts on behalf of the
AWS account that created it, with the permissions specified in the
metrics_policy_document data block

* Remove unneeded formatted strings

* Configure and enable YACE

A good resource on how to configure the delay, period and length:
prometheus-community/yet-another-cloudwatch-exporter#865 (comment)

* Add new graphs to default and ES dashboards

* Update default dashboard ID

* Escape double curly braces in template file

* Rename dashboard_data to default_dashboard_tmpl

* make assets

* Keep only StatusCheckFailed metrics for EC2

Set nilToZero to true as well to get metrics for all instances, since 0
is the success value.

* Add new panel reporting status of all instances

* make assets

* make assets (now with the actual changes)

* Escape curly brances in tag_Name

* And make assets again

* Fix value mapping, metric can report values > 1

* make assets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants