-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[Colocation] Make monitor interval and high usage limit configurable #4924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Colocation] Make monitor interval and high usage limit configurable #4924
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @kajal-jotwani! It looks like this is your first PR to volcano-sh/volcano 🎉 |
Summary of ChangesHello @kajal-jotwani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the node colocation feature by making key monitoring parameters configurable. It allows administrators to customize how frequently node resources are monitored and the threshold for consecutive high usage detections before eviction is triggered. This change improves flexibility and adaptability for different operational environments while ensuring data integrity through input validation and maintaining existing default behaviors. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces configurability for the monitor interval and high usage limit, which is a valuable enhancement. The implementation of configuration and validation is well done. However, I've identified a critical deadlock issue due to incorrect lock ordering in node_monitor.go. Additionally, there's a potential performance problem with the use of time.Ticker for dynamic intervals that could lead to high CPU usage under certain conditions. I've provided detailed comments and code suggestions to address these issues.
| func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) { | ||
| ticker := time.NewTicker(m.getMonitorInterval()) | ||
| defer ticker.Stop() | ||
|
|
||
| for { | ||
| select { | ||
| case <-stop: | ||
| return | ||
| case <-ticker.C: | ||
| f() | ||
| newInterval := m.getMonitorInterval() | ||
| ticker.Reset(newInterval) | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using time.Ticker with Reset in this manner can lead to performance issues. If the function f() takes longer to execute than the ticker's interval, the ticker will still fire and buffer a tick. The next loop iteration will then execute f() immediately, without waiting for the new interval. This can cause f() to run in a tight loop, potentially causing high CPU usage.
A safer and more conventional approach for this pattern is to use time.Timer, which guarantees a delay between executions, regardless of how long f() takes.
func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) {
interval := m.getMonitorInterval()
timer := time.NewTimer(interval)
defer timer.Stop()
for {
select {
case <-stop:
return
case <-timer.C:
f()
newInterval := m.getMonitorInterval()
if newInterval != interval {
interval = newInterval
timer.Reset(interval)
} else {
timer.Reset(interval)
}
}
}
}Add support for configurable monitor interval and high usage count limit in colocation scenarios. Previously these were hardcoded to 10 seconds and 6 counts respectively. The new configuration supports hot-reload without requiring agent restart. Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
355a2a9 to
eb87e3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds configurable parameters for node monitoring interval and high usage count limit in the colocation feature. Previously, these values were hardcoded (10 seconds and 6 consecutive detections respectively).
Key changes:
- Added two new configuration fields (
MonitorIntervalandHighUsageCountLimit) to theEvictingconfig struct - Implemented dynamic interval updates using a custom ticker-based monitoring loop
- Added validation to ensure both parameters are positive numbers
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| pkg/agent/config/api/types.go | Added MonitorInterval and HighUsageCountLimit fields to the Evicting struct with JSON tags |
| pkg/agent/config/api/validate.go | Added validation logic to ensure both new parameters are positive numbers |
| pkg/agent/config/utils/utils.go | Added default constants and updated default configuration and JSON template with the new parameters |
| pkg/agent/events/probes/nodemonitor/node_monitor.go | Replaced hardcoded constants with configurable fields, implemented dynamic interval monitoring with runWithDynamicInterval, and updated RefreshCfg and nodeHasPressure to use the new fields |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if e.MonitorInterval != nil && *e.MonitorInterval <= 0 { | ||
| errs = append(errs, errors.New(IllegalMonitorInterval)) | ||
| } | ||
| if e.HighUsageCountLimit != nil && *e.HighUsageCountLimit <= 0 { | ||
| errs = append(errs, errors.New(IllegalHighUsageCountLimit)) | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new validation logic for MonitorInterval and HighUsageCountLimit lacks test coverage. Since the repository contains tests for other validation logic in validate_test.go, test cases should be added to verify the behavior when these fields are set to invalid values (zero or negative numbers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added tests for validate.go
| // Update monitor interval if configured | ||
| if cfg.EvictingConfig != nil && cfg.EvictingConfig.MonitorInterval != nil { | ||
| m.monitorInterval = time.Duration(*cfg.EvictingConfig.MonitorInterval) * time.Second | ||
| } | ||
|
|
||
| // Update high usage count limit if configured | ||
| if cfg.EvictingConfig != nil && cfg.EvictingConfig.HighUsageCountLimit != nil { | ||
| m.highUsageCountLimit = *cfg.EvictingConfig.HighUsageCountLimit | ||
| } | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new dynamic interval update logic in RefreshCfg and the runWithDynamicInterval function lack test coverage. Since the repository contains tests for the monitor in node_monitor_test.go, test cases should be added to verify that monitor interval and high usage count limit updates are properly applied during runtime configuration refresh.
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
767e743 to
5e2e0f1
Compare
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
|
/ok-to-test |
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
What type of PR is this?
kind/feature
area colocation
What this PR does / why we need it:
Currently, the node monitor uses hardcoded values for monitoring
interval (10 seconds) and high usage count limit (6 consecutive
detections).
This change adds two new configurable parameters to EvictingConfig:
Both parameters maintain backward compatibility with defaults of
10 seconds and 6 times respectively. The values are validated to
ensure they are positive numbers and can be configured via the
volcano-agent ConfigMap.
Which issue(s) this PR fixes:
Fixes #4911
Special notes for your reviewer:
Does this PR introduce a user-facing change?