-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cache expiration #31785
Feature/cache expiration #31785
Conversation
@@ -83,7 +84,7 @@ metadata: | |||
data: | |||
system.yml: |- | |||
- module: system | |||
period: 10s | |||
period: 1m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert those values to original ones before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments we already discussed offline:
- Can you revert the changes on the config files that are not needed for the patch itself?
- In a quick view, I think we can use your logic to set the expiration’s default value and make it also available as an optional setting. In this, our users can just ignore this setting and the defaults will do the trick. If in any case someone wants to specify this setting there will be an option available.
Some key code parts that might help in adding the setting:
type kubernetesConfig struct { - most probably you will need to unpack the config at the metricset level like what we do inside the
Enricher
's implementation:config := ValidatedConfig(base) - Check
Enricher
's initialisation:enricher: util.NewContainerMetadataEnricher(base, true),
@@ -72,6 +72,12 @@ func New(base mb.BaseMetricSet) (mb.MetricSet, error) { | |||
if !ok { | |||
return nil, fmt.Errorf("must be child of kubernetes module") | |||
} | |||
|
|||
newTimeout := base.Module().Config().Period * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have a logic that in case the base.Module().Config().Period
is >= to 120 seconds, only then set the newTimeout base.Module().Config().Period * 2
. And that should happen only if the user hasn't specified a value of the timeout in the config.
If we always set the newTimeout to scraping_period*2
, then in most often cases where period is 10s , the cache will expire too soon and maybe that defeats its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain the current behaviour in a PR comment at #31785 (comment). Happy to have a longer chat about this.
In short with a 10s period and 20s cache timeout, the cache expires every 20s but it is still replaces every 10s (if all scrapes work).
I would suggest to have period * 5
to have a longer cache and account for possible failures during scraping.
I put
A cache expiration time of 5 * scraping.Period also prevent that the cache expires if one or more scraping do NOT take place. scenarios with current code (before modifications in this PR):
Scenarios with modifications in this PR:
There is a similar config for the expiration of the cache at |
Comments on #31785 (review)
|
In that case you can just add everything needed at |
This pull request is now in conflicts. Could you fix it? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor suggestion/fix for the changelog. Keep in mind that changelog entries are user facing content and thus should be quite accurate and not super technical.
Co-authored-by: Chris Mark <chrismarkou92@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed work here!
Wait for green CI and ship it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only missing a screenshot after the implementation to see that final metrics actually appear.
Also is there a simple way to verify the value of cache timeout without debugging?
@@ -27,7 +27,7 @@ data: | |||
- config: | |||
- module: kubernetes | |||
hosts: ["kube-state-metrics:8080"] | |||
period: 10s | |||
period: 3m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here revert before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are looking at an outdated commit. this has been resolved already
@@ -156,12 +157,18 @@ spec: | |||
dnsPolicy: ClusterFirstWithHostNet | |||
containers: | |||
- name: metricbeat | |||
image: docker.elastic.co/beats/metricbeat:8.4.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to add this back? Align with master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are looking at an outdated commit. this has been resolved already
image: docker.elastic.co/beats/metricbeat:8.4.0 | ||
# image: docker.elastic.co/beats/metricbeat:8.3.0-SNAPSHOT | ||
image: metricbeat-debugger-image:latest | ||
imagePullPolicy: Never |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you remove all debug fields as they should not be merged.
I think you should create a new folder that will describe how to test.
Add test/ debug in the README maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are looking at an outdated commit. this has been resolved already
@@ -33,13 +33,19 @@ func init() { | |||
const defaultTimeout = 120 * time.Second | |||
|
|||
// NewPerfMetricsCache initializes and returns a new PerfMetricsCache | |||
func NewPerfMetricsCache() *PerfMetricsCache { | |||
func NewPerfMetricsCache(timeout time.Duration) *PerfMetricsCache { | |||
if timeout <= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to check timeout != nil as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout cannot be nil since it is not a reference. It should always be initialized to something
func (c *PerfMetricsCache) GetTimeout() time.Duration { | ||
var ans time.Duration = 0 | ||
|
||
nmATimeout := c.NodeMemAllocatable.GetTimeout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of multiple if comparisons you can initialise an array and make only one assighnemnt eg. https://learningprogramming.net/golang/golang-golang/find-max-and-min-of-array-in-golang/
Just an idea to consider
* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go
* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go
* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_cronjob/state_cronjob.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go
* fixed cache expiration bug (cherry picked from commit 4b625fc) # Conflicts: # metricbeat/module/kubernetes/container/container.go # metricbeat/module/kubernetes/pod/pod.go # metricbeat/module/kubernetes/state_daemonset/state_daemonset.go # metricbeat/module/kubernetes/state_persistentvolume/state_persistentvolume.go # metricbeat/module/kubernetes/state_persistentvolumeclaim/state_persistentvolumeclaim.go # metricbeat/module/kubernetes/util/kubernetes.go # metricbeat/module/kubernetes/volume/volume.go
* fixed cache expiration bug (cherry picked from commit 4b625fc)
Sorry I missed to add these. I took them and forgot to add them. They are in the description now
I have another Dockerfile I am using to test the backport to 7.17... without debugging. I haven't pushed it to git yet. I am not sure if there is a more official way to do this. |
* Feature/cache expiration (#31785)
* Feature/cache expiration (#31785)
* Feature/cache expiration (#31785)
* Feature/cache expiration (#31785)
* Feature/cache expiration (elastic#31785)
* Feature/cache expiration (elastic#31785)
* Feature/cache expiration (elastic#31785)
* fixed cache expiration bug
What does this PR do?
Make the internal cache PerfMetricsCache used by various metricsets in the module kubernetes in metricbeat to never expire. This avoid missing metrics or calculating incorrect metrics as explained in this issue #31778.
This PR dynamically changes the timeout of the internal cache PerfMetricsCache in the kubernetes module so that it's always larger than the scraping period set by the users in the configs. In particular, the cache timeout is always set to 2x the largest
period
setting for any of the metricsets.Why is it important?
It fixes a bug reported by a customer in an SDH that prevented some metrics like
kubernetes.container.cpu.usage.limit.pct
from being populated.A customer first reported this issue when setting in the kubernetes module settings a
period
to be 15 minutes. This bug was caused by the cache timeout to be hardcoded in code to 2 minutes.This PR fixes the missing metrics issue and make possible to set any value for
period
.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
period
setting in the kubernetes module larger than 2 minutes. Observe the following metrics to be correctly computed (values >=0 but NOT missing):Before the code change:
Some memory metrics for
kubernetes.container.memory.usage.limit.pct
are computed but most of them are missing.Some metrics for
kubernetes.container.cpu.usage.limit.pct
are correctly computed but most of them are missing.The percentage of metrics missing depends on the different configs for the period of the various metricsets. In order to have this fixed, the metrics should be there all the time with a value >= 0.
After the code changes:
Related issues
Link related issues below. Insert the issue link or reference after the word "Closes" if merging this should automatically close it.