Feature/cache expiration #31785

gsantoro · 2022-05-31T15:09:45Z

What does this PR do?

Make the internal cache PerfMetricsCache used by various metricsets in the module kubernetes in metricbeat to never expire. This avoid missing metrics or calculating incorrect metrics as explained in this issue #31778.

This PR dynamically changes the timeout of the internal cache PerfMetricsCache in the kubernetes module so that it's always larger than the scraping period set by the users in the configs. In particular, the cache timeout is always set to 2x the largest period setting for any of the metricsets.

Why is it important?

It fixes a bug reported by a customer in an SDH that prevented some metrics like kubernetes.container.cpu.usage.limit.pct from being populated.

A customer first reported this issue when setting in the kubernetes module settings a period to be 15 minutes. This bug was caused by the cache timeout to be hardcoded in code to 2 minutes.

This PR fixes the missing metrics issue and make possible to set any value for period.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Configure metricbeat to have any of the period setting in the kubernetes module larger than 2 minutes. Observe the following metrics to be correctly computed (values >=0 but NOT missing):
- kubernetes.container.cpu.usage.limit.pct
- kubernetes.container.memory.usage.limit.pct
- kubernetes.pod.cpu.usage.limit.pct
- kubernetes.pod.memory.usage.limit.pct

Before the code change:
Some memory metrics for kubernetes.container.memory.usage.limit.pct are computed but most of them are missing.

Some metrics for kubernetes.container.cpu.usage.limit.pct are correctly computed but most of them are missing.

The percentage of metrics missing depends on the different configs for the period of the various metricsets. In order to have this fixed, the metrics should be there all the time with a value >= 0.

After the code changes:

Related issues

Link related issues below. Insert the issue link or reference after the word "Closes" if merging this should automatically close it.

Closes #123

elasticmachine · 2022-05-31T15:27:08Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-06-17T12:20:42.851+0000
Duration: 90 min 6 sec

Test stats 🧪

Test	Results
Failed	0
Passed	21685
Skipped	1798
Total	23483

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

gizas · 2022-06-01T08:36:25Z

deploy/kubernetes/metricbeat-kubernetes.yaml

@@ -83,7 +84,7 @@ metadata:
 data:
  system.yml: |-
    - module: system
-      period: 10s
+      period: 1m


Please revert those values to original ones before merge

ChrsMark

A couple of comments we already discussed offline:

Can you revert the changes on the config files that are not needed for the patch itself?
In a quick view, I think we can use your logic to set the expiration’s default value and make it also available as an optional setting. In this, our users can just ignore this setting and the defaults will do the trick. If in any case someone wants to specify this setting there will be an option available.

Some key code parts that might help in adding the setting:

beats/metricbeat/module/kubernetes/util/kubernetes.go

Line 52 in 9535a58

type kubernetesConfig struct {
most probably you will need to unpack the config at the metricset level like what we do inside the Enricher's implementation:

beats/metricbeat/module/kubernetes/util/kubernetes.go

Line 186 in 9535a58

config := ValidatedConfig(base)
Check Enricher's initialisation:

beats/metricbeat/module/kubernetes/container/container.go

Line 78 in 89bcc33

enricher: util.NewContainerMetadataEnricher(base, true),

MichaelKatsoulis · 2022-06-01T14:07:32Z

metricbeat/module/kubernetes/container/container.go

@@ -72,6 +72,12 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) {
 	if !ok {
 		return nil, fmt.Errorf("must be child of kubernetes module")
 	}
+
+	newTimeout := base.Module().Config().Period * 2


I think we should have a logic that in case the base.Module().Config().Period is >= to 120 seconds, only then set the newTimeout base.Module().Config().Period * 2. And that should happen only if the user hasn't specified a value of the timeout in the config.
If we always set the newTimeout to scraping_period*2 , then in most often cases where period is 10s , the cache will expire too soon and maybe that defeats its purpose.

I tried to explain the current behaviour in a PR comment at #31785 (comment). Happy to have a longer chat about this.
In short with a 10s period and 20s cache timeout, the cache expires every 20s but it is still replaces every 10s (if all scrapes work).

I would suggest to have period * 5 to have a longer cache and account for possible failures during scraping.

gsantoro · 2022-06-07T16:20:06Z

I put scraping.period * 2 as an initial placeholder. I wasn't sure about 2 as a final value. Maybe we should have scraping.period * 5 to be more resilient. I don't have a strong opinion on the multiplier value.
My point is though, that this behaviour should be the same no matter the length of the scraping period. The value in the cache is replaced at every successful scrape and the expiration updated to a full timeout (cache period), even if the entries are not expired, as long as there has been at least 1 full period from last scraping or it is the first scraping

if statsCache.lastFetchTimestamp.IsZero() || now.Sub(statsCache.lastFetchTimestamp) > m.Config().Period { ... http.FetchContent() ... }

A cache expiration time of 5 * scraping.Period also prevent that the cache expires if one or more scraping do NOT take place.

scenarios with current code (before modifications in this PR):

Example 1
- scrape every 10s
- cache expiration at 2m
- cache is replaced every 10s (at every scrape)
- cache expires 2m in the future of the last scrape
- metrics are correctly calculated since the cache is always full
Example 2 (failure discovered in SDH)
- scrape every 10m
- cache expiration at 2m
- cache expires 2m after the last scrape. it stays empty for 8m, metricbeat read an empty cache and it calculates wrong metrics
- cache is refilled but only read at the next period after 10m when it is already empty again.

Scenarios with modifications in this PR:

Example 1
- scrape at 10s
- cache expiration at 50s
- with no scrape failures, cache is replenished every 10s and it expires 50s in the future of the last scrape
- if some scrapes fails to run, if there is at least 1 scrape every 50s, the cache expires 50s in the future of the last scrape and the right metrics are calculates. You can have up to 4/5 failed scrapes and still behave correctly
Example 2
- scrape at 10m
- cache expiration at 50m
- similarly to example 1. Cache never expires, it is resilient to scraping failures
- the user doesn't need to guess what it a good value for the cache expiration

There is a similar config for the expiration of the cache at beats/x-pack/metricbeat/module/prometheus/collector/data.go at line 29 counters := collector.NewCounterCache(base.Module().Config().Period * 5)

gsantoro · 2022-06-07T16:24:52Z

Comments on #31785 (review)

I will revert the changes. I left them there for you guys to check the behaviour of the new code
I agree we should expose an optional config for the user to customise the behaviour of the cache and do the trick with cache_expiration = period * 5 otherwise. Thanks for the suggestions. I'll have a look

ChrsMark · 2022-06-08T08:30:04Z

Comments on #31785 (review)

I will revert the changes. I left them there for you guys to check the behaviour of the new code

In that case you can just add everything needed at How to test this PR locally section in the PR's description (see for example #25640).

mergify · 2022-06-17T12:14:31Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature/cache_expiration upstream/feature/cache_expiration
git merge upstream/main
git push upstream feature/cache_expiration

ChrsMark

A minor suggestion/fix for the changelog. Keep in mind that changelog entries are user facing content and thus should be quite accurate and not super technical.

CHANGELOG.next.asciidoc

Co-authored-by: Chris Mark <chrismarkou92@gmail.com>

ChrsMark

Thanks for the detailed work here!
Wait for green CI and ship it!

gizas

Only missing a screenshot after the implementation to see that final metrics actually appear.

Also is there a simple way to verify the value of cache timeout without debugging?

gizas · 2022-06-01T08:39:04Z

deploy/kubernetes/metricbeat-kubernetes.yaml

@@ -27,7 +27,7 @@ data:
            - config:
                - module: kubernetes
                  hosts: ["kube-state-metrics:8080"]
-                  period: 10s
+                  period: 3m


Same here revert before merge

you are looking at an outdated commit. this has been resolved already

gizas · 2022-06-01T08:40:23Z

deploy/kubernetes/metricbeat-kubernetes.yaml

@@ -156,12 +157,18 @@ spec:
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: metricbeat
-        image: docker.elastic.co/beats/metricbeat:8.4.0


I think you need to add this back? Align with master

you are looking at an outdated commit. this has been resolved already

gizas · 2022-06-01T08:41:53Z

deploy/kubernetes/metricbeat-kubernetes.yaml

-        image: docker.elastic.co/beats/metricbeat:8.4.0
+        # image: docker.elastic.co/beats/metricbeat:8.3.0-SNAPSHOT
+        image: metricbeat-debugger-image:latest
+        imagePullPolicy: Never


Also can you remove all debug fields as they should not be merged.
I think you should create a new folder that will describe how to test.

Add test/ debug in the README maybe

you are looking at an outdated commit. this has been resolved already

gizas · 2022-06-01T08:43:21Z

metricbeat/module/kubernetes/util/metrics_cache.go

@@ -33,13 +33,19 @@ func init() {
 const defaultTimeout = 120 * time.Second

 // NewPerfMetricsCache initializes and returns a new PerfMetricsCache
-func NewPerfMetricsCache() *PerfMetricsCache {
+func NewPerfMetricsCache(timeout time.Duration) *PerfMetricsCache {
+	if timeout <= 0 {


Does it make sense to check timeout != nil as well?

timeout cannot be nil since it is not a reference. It should always be initialized to something

gizas · 2022-06-17T13:26:51Z

metricbeat/module/kubernetes/util/metrics_cache.go

+func (c *PerfMetricsCache) GetTimeout() time.Duration {
+	var ans time.Duration = 0
+
+	nmATimeout := c.NodeMemAllocatable.GetTimeout()


Instead of multiple if comparisons you can initialise an array and make only one assighnemnt eg. https://learningprogramming.net/golang/golang-golang/find-max-and-min-of-array-in-golang/

Just an idea to consider

* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go

* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go

* fixed cache expiration bug (cherry picked from commit 4b625fc)

gsantoro · 2022-06-17T15:07:44Z

@gizas

Only missing a screenshot after the implementation to see that final metrics actually appear.

Sorry I missed to add these. I took them and forgot to add them. They are in the description now

Also is there a simple way to verify the value of cache timeout without debugging?

I have another Dockerfile I am using to test the backport to 7.17... without debugging. I haven't pushed it to git yet. I am not sure if there is a more official way to do this.

* fixed cache expiration bug

* Feature/cache expiration (#31785)

* Feature/cache expiration (elastic#31785)

* fixed cache expiration bug

gsantoro added 3 commits May 31, 2022 09:45

first attempt at fixing issue with cache expiration

7a996d1

testing setting cache expiration automatically

77dd523

minor changes to test it

e6d96cf

gsantoro added bug enhancement in progress Pull request is currently in progress. Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels May 31, 2022

gsantoro requested review from ChrsMark and MichaelKatsoulis May 31, 2022 15:09

gsantoro requested review from a team as code owners May 31, 2022 15:09

gsantoro self-assigned this May 31, 2022

gsantoro requested review from cmacknz and fearful-symmetry and removed request for a team May 31, 2022 15:09

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 31, 2022

gsantoro added discuss Issue needs further discussion. Draft labels May 31, 2022

gizas reviewed Jun 1, 2022

View reviewed changes

ChrsMark reviewed Jun 1, 2022

View reviewed changes

MichaelKatsoulis reviewed Jun 1, 2022

View reviewed changes

revert changes to debug PR

1440722

tests

3cd44e3

gsantoro added the backport-7.17 Automated backport to the 7.17 branch with mergify label Jun 14, 2022

Merge branch 'main' into feature/cache_expiration

d958625

ChrsMark requested changes Jun 17, 2022

View reviewed changes

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved

Update CHANGELOG.next.asciidoc

d9fdeee

Co-authored-by: Chris Mark <chrismarkou92@gmail.com>

ChrsMark approved these changes Jun 17, 2022

View reviewed changes

gizas approved these changes Jun 17, 2022

View reviewed changes

gsantoro merged commit 4b625fc into elastic:main Jun 17, 2022

mergify bot mentioned this pull request Jun 17, 2022

[7.17](backport #31785) Feature/cache expiration #31974

Merged

mergify bot mentioned this pull request Jun 17, 2022

[8.0](backport #31785) Feature/cache expiration #31975

Merged

This was referenced Jun 17, 2022

[8.1](backport #31785) Feature/cache expiration #31976

Merged

[8.2](backport #31785) Feature/cache expiration #31977

Merged

mergify bot pushed a commit that referenced this pull request Jun 17, 2022

Feature/cache expiration (#31785)

50aeab5

* fixed cache expiration bug (cherry picked from commit 4b625fc)

mergify bot mentioned this pull request Jun 17, 2022

[8.3](backport #31785) Feature/cache expiration #31979

Merged

gsantoro pushed a commit that referenced this pull request Jun 20, 2022

Feature/cache expiration (#31785) (#31979)

3a29e47

* fixed cache expiration bug

gsantoro pushed a commit that referenced this pull request Jun 20, 2022

[7.17](backport #31785) Feature/cache expiration (#31974)

ad05697

* Feature/cache expiration (#31785)

gsantoro pushed a commit that referenced this pull request Jun 20, 2022

[8.1](backport #31785) Feature/cache expiration (#31976)

9d8bf3b

* Feature/cache expiration (#31785)

gsantoro pushed a commit that referenced this pull request Jun 20, 2022

[8.0](backport #31785) Feature/cache expiration (#31975)

be2c0e5

* Feature/cache expiration (#31785)

gsantoro pushed a commit that referenced this pull request Jun 20, 2022

[8.2](backport #31785) Feature/cache expiration (#31977)

77cfcdf

* Feature/cache expiration (#31785)

gsantoro removed request for cmacknz and fearful-symmetry June 21, 2022 13:37

gsantoro deleted the feature/cache_expiration branch July 12, 2022 12:52

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023

[8.0](backport elastic#31785) Feature/cache expiration (elastic#31975)

77021f2

* Feature/cache expiration (elastic#31785)

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023

[8.1](backport elastic#31785) Feature/cache expiration (elastic#31976)

924001d

* Feature/cache expiration (elastic#31785)

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023

[8.2](backport elastic#31785) Feature/cache expiration (elastic#31977)

580b653

* Feature/cache expiration (elastic#31785)

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023

Feature/cache expiration (#31785)

b7d1c57

* fixed cache expiration bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/cache expiration #31785

Feature/cache expiration #31785

gsantoro commented May 31, 2022 •

edited

Loading

elasticmachine commented May 31, 2022 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

gizas Jun 1, 2022

gsantoro Jun 7, 2022

ChrsMark left a comment

MichaelKatsoulis Jun 1, 2022 •

edited

Loading

gsantoro Jun 7, 2022

gsantoro commented Jun 7, 2022

gsantoro commented Jun 7, 2022

ChrsMark commented Jun 8, 2022

mergify bot commented Jun 17, 2022

ChrsMark left a comment

ChrsMark left a comment

gizas left a comment

gizas Jun 1, 2022

gsantoro Jun 17, 2022

gizas Jun 1, 2022

gsantoro Jun 17, 2022

gizas Jun 1, 2022

gsantoro Jun 17, 2022

gizas Jun 1, 2022

gsantoro Jun 17, 2022

gizas Jun 17, 2022

gsantoro commented Jun 17, 2022 •

edited by ChrsMark

Loading

Feature/cache expiration #31785

Feature/cache expiration #31785

Conversation

gsantoro commented May 31, 2022 • edited Loading

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

elasticmachine commented May 31, 2022 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

🤖 GitHub comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrsMark left a comment

Choose a reason for hiding this comment

MichaelKatsoulis Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsantoro commented Jun 7, 2022

gsantoro commented Jun 7, 2022

ChrsMark commented Jun 8, 2022

mergify bot commented Jun 17, 2022

ChrsMark left a comment

Choose a reason for hiding this comment

ChrsMark left a comment

Choose a reason for hiding this comment

gizas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsantoro commented Jun 17, 2022 • edited by ChrsMark Loading

gsantoro commented May 31, 2022 •

edited

Loading

elasticmachine commented May 31, 2022 •

edited by jenkins-beats-ci bot

Loading

MichaelKatsoulis Jun 1, 2022 •

edited

Loading

gsantoro commented Jun 17, 2022 •

edited by ChrsMark

Loading