[QUESTION] relation between scraping-interval, length, period #865

jdix531 · 2023-03-27T20:50:19Z

Hi, I am trying to make sense of how to best use these options to enable monitoring for my infrastructure as best as possible.

scraping-interval seems the most obvious, which is the frequency in which you'll refresh all metric data aka hit the aws cloudwatch api for data.

Period and length confuse me a bit more and seem to make sense only for metrics with averages.

Period seems to be the requested accuracy of a metric ie. over the last how-many-seconds should this metric be considered.
Length would be the total time you want the periods to be considered (and it should be a multiple of the period).

If you're not looking at average metrics however it seems that you'll be limited to the scraping interval for overall accuracy.

Can someone explain this a bit more? If I can be made to understand it I'll put a PR for the docs!

The relation between the 'job-default' and 'metric-default' will likely make sense when this is explained where I imagine the more atomic user-defined value is accepted.

Please and thank you

PerGon · 2023-04-21T11:03:18Z

I've had the very same questions myself. After some investigation, I've created some slides to use internally in my organization. I'm happy to share them here - and if appreciated, I can contribute with PRs to the documentation to include these explanations.

I'll wait for some feedback from the maintainers. Let me know if some of the info above is incorrect and I'll fix it. Whenever the info is correct, I can create the PRs to include this in the docs if desirable 😄

kgeckhart · 2023-05-04T12:22:19Z

I'm not a maintainer but have been working with YACE/CloudWatch for a bit. @PerGon that's a great breakdown of the parameters and those tips and tricks are spot on!

There's an extra piece that YACE supports, roundingPeriod, which is good to consider. The GetMetricAPI docs have a recommendation for performance,

For better performance, specify StartTime and EndTime values that align with the value of the metric's Period and sync up with the beginning and end of an hour. For example, if the Period of a metric is 5 minutes, specifying 12:05 or 12:30 as EndTime can get a faster response from CloudWatch than setting 12:07 or 12:29 as the EndTime.

Since it's incredibly unlikely that your scrapes are going to align with this recommendation you can set roundingPeriod = period. This will ensure that if you scrape metrics at 7:07PM with a length and roundingPeriod of 5m the requested interval will be 7:00-7:05PM. The other added benefit of this approach is that data exported by YACE should be able to be reproduced in CloudWatch. CloudWatch will only graph data according to these best practices so if you graph a metric with a period of 5m in CloudWatch you are going to see data points on the 5m mark.

luismy · 2023-06-10T16:12:05Z

@PerGon, I love your slides they're really useful. I've opened today this issue/improvement (#985) which it's kind of related to this. My understanding, at least when scraping normal metrics (including average), is that YACE will always pick the latest one, see:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/c7807a770bb427f8ddb2c7becac51185fb3e8230/pkg/clients/cloudwatch/v1/client.go#L124-L125

bafulton · 2024-07-18T09:37:18Z

@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?

PerGon · 2024-08-09T13:15:33Z

(sorry the delay, was on vacations)

@PerGon, thanks for sharing your slides! Could you provide the link for the "Cloudwatch API has a delay in exposing metrics" line?

Sure, that link is pointing to this datadog documentation

cspotcode · 2024-08-28T03:07:45Z

The "multiple data points per scrape" behavior seems wrong to me. It appears to make period pointless, and also causes data loss for sum metrics.

Quoting the screenshots above, emphasis mine:

How YACE deals with multiple data points per AWS CloudWatch scrape

When YACE does 1 cloudwatch scrape, it can, possibly, get various points for the same metric. But YACE will only expose one single value. So this is what it does to reach the value to expose (source code):

If "statistic" is of type "Average"
YACE will calculate the average of all those data points and use that average

If any other "statistic" (Maximum, Sum, etc)
Use the value of the most recent data point

Firstly, what is the point of period if YACE is going to merge all those datapoints down to one? Why request a lower period granularity than scraping-interval if YACE can only submit a single datapoint per scraping-interval?

Secondly, if YACE uses the most recent value, and discards all others, doesn't that result in data loss? If a "sum" metric counts how many of a given thing occur in a given time interval, then we must sum all datapoints to aggregate smaller time intervals into a larger one.

For example, suppose I have this config:

scraping-interval: 30s
period: 10s
length: 30s
metric statistic: sum

YACE will, every 30s, retrieve the past 30s of datapoints with a 10s resolution (3 unique datapoints)

I expect/want YACE to do this:
Export all 3 datapoints, allowing my downstream dashboard tools to sum them into a total.

But based on what I'm reading, YACE actually does this:
Submit the last datapoint, discard the first 2.

This means that anything occurring in seconds 0 through 20 will be ignored. It will only export the sum of things which happened from seconds 20 to 30 in the interval: the last datapoint.

A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)

mtcoder24 · 2024-09-09T14:31:00Z

found this while searching for some info on why the metric values are unexpected.

I don't think it simply takes the last data point....
with sum, i think it sums all the datapoints.

Also if only the last datapoint was considered then max/min would always be the same, no?
or am i missing something ?
EDIT: maybe YACE is applying the statistic ON the set of "last data point". That would make sense and the length serves only to specify how far in the past to look for data points

cspotcode · 2024-09-10T21:50:54Z

I don't think it simply takes the last data point.... with sum, i think it sums all the datapoints.

I wish you were correct, but I'm worried because the slides shared above directly contradict that.

And in this code, I'm seeing that it drops all datapoints except the most recent one.
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/clients/cloudwatch/v2/client.go#L145-L156

I certainly hope I am misunderstanding the code, but so far it seems like it's dropping all but the last data point.

cspotcode · 2024-09-11T15:35:32Z

Another concerning tidbit from the code: There is code to aggregate datapoints for GetMetricStatistic calls, but the GetMetricData codepaths do not do that!

GetMetricStatistic codepath:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L155-L186

GetMetricData codepath, single datapoint, I think this relates to the code linked in my previous comment:
https://github.com/nerdswords/yet-another-cloudwatch-exporter/blob/f5ddcf4323dc97034491114d4074ae672cfc411f/pkg/promutil/migrate.go#L149-L151

mtcoder24 · 2024-09-12T07:45:46Z

Thanks for your reply.

I never used Go but if it is anything like C, I think it is getting the address of the first element in the array no ?

mappedResult.Datapoint = **&**metricDataResult.Values[0]

so it depends on what the caller is doing with mappedResult

I will try to get some time over the weekend to look better

A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)

* Add ClusterName tag to all resources * Add missing Name tag to RDS cluster and instances * Install YACE in the metrics instance * Add needed policies and roles for YACE to work This allows the metrics instance to access accounts on behalf of the AWS account that created it, with the permissions specified in the metrics_policy_document data block * Remove unneeded formatted strings * Configure and enable YACE A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment) * Add new graphs to default and ES dashboards * Update default dashboard ID * Escape double curly braces in template file * Rename dashboard_data to default_dashboard_tmpl * make assets * Keep only StatusCheckFailed metrics for EC2 Set nilToZero to true as well to get metrics for all instances, since 0 is the success value. * Add new panel reporting status of all instances * make assets * make assets (now with the actual changes) * Escape curly brances in tag_Name * And make assets again * Fix value mapping, metric can report values > 1 * make assets

vainiusd mentioned this issue Sep 11, 2023

[BUG] Missing data points for Lambda after scrape interval increased to 2 mins #1076

Open

1 task

zachary-povey mentioned this issue Aug 7, 2024

[DOCUMENTATION] Improve Documentation of Scraping Behaviour #1496

Open

agarciamontoro added a commit to mattermost/mattermost-load-test-ng that referenced this issue Sep 9, 2024

Configure and enable YACE

c72d03b

A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)

agarciamontoro mentioned this issue Sep 12, 2024

MM-60020: Cloudwatch exporter mattermost/mattermost-load-test-ng#798

Merged

agarciamontoro added a commit to mattermost/mattermost-load-test-ng that referenced this issue Sep 12, 2024

Configure and enable YACE

05d2d46

A good resource on how to configure the delay, period and length: prometheus-community/yet-another-cloudwatch-exporter#865 (comment)

latelan mentioned this issue Oct 27, 2024

Use delay to obtain accurate CloudFront requests metrics #1549

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] relation between scraping-interval, length, period #865

[QUESTION] relation between scraping-interval, length, period #865

jdix531 commented Mar 27, 2023

PerGon commented Apr 21, 2023

kgeckhart commented May 4, 2023 •

edited

Loading

luismy commented Jun 10, 2023 •

edited

Loading

bafulton commented Jul 18, 2024 •

edited

Loading

PerGon commented Aug 9, 2024

cspotcode commented Aug 28, 2024 •

edited

Loading

How YACE deals with multiple data points per AWS CloudWatch scrape

mtcoder24 commented Sep 9, 2024 •

edited

Loading

cspotcode commented Sep 10, 2024

cspotcode commented Sep 11, 2024

mtcoder24 commented Sep 12, 2024

[QUESTION] relation between scraping-interval, length, period #865

[QUESTION] relation between scraping-interval, length, period #865

Comments

jdix531 commented Mar 27, 2023

PerGon commented Apr 21, 2023

kgeckhart commented May 4, 2023 • edited Loading

luismy commented Jun 10, 2023 • edited Loading

bafulton commented Jul 18, 2024 • edited Loading

PerGon commented Aug 9, 2024

cspotcode commented Aug 28, 2024 • edited Loading

How YACE deals with multiple data points per AWS CloudWatch scrape

mtcoder24 commented Sep 9, 2024 • edited Loading

cspotcode commented Sep 10, 2024

cspotcode commented Sep 11, 2024

mtcoder24 commented Sep 12, 2024

kgeckhart commented May 4, 2023 •

edited

Loading

luismy commented Jun 10, 2023 •

edited

Loading

bafulton commented Jul 18, 2024 •

edited

Loading

cspotcode commented Aug 28, 2024 •

edited

Loading

mtcoder24 commented Sep 9, 2024 •

edited

Loading