[Colocation] Make monitor interval and high usage limit configurable #4924

kajal-jotwani · 2026-01-07T07:41:32Z

What type of PR is this?

kind/feature
area colocation

What this PR does / why we need it:

Currently, the node monitor uses hardcoded values for monitoring
interval (10 seconds) and high usage count limit (6 consecutive
detections).
This change adds two new configurable parameters to EvictingConfig:

monitorInterval: controls how often node resources are monitored
highUsageCountLimit: defines consecutive high readings before eviction

Both parameters maintain backward compatibility with defaults of
10 seconds and 6 times respectively. The values are validated to
ensure they are positive numbers and can be configured via the
volcano-agent ConfigMap.

Which issue(s) this PR fixes:

Fixes #4911

Special notes for your reviewer:

Does this PR introduce a user-facing change?

volcano-sh-bot · 2026-01-07T07:41:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jessestutler for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

volcano-sh-bot · 2026-01-07T07:41:43Z

Welcome @kajal-jotwani! It looks like this is your first PR to volcano-sh/volcano 🎉

gemini-code-assist · 2026-01-07T07:41:53Z

Summary of Changes

Hello @kajal-jotwani, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the node colocation feature by making key monitoring parameters configurable. It allows administrators to customize how frequently node resources are monitored and the threshold for consecutive high usage detections before eviction is triggered. This change improves flexibility and adaptability for different operational environments while ensuring data integrity through input validation and maintaining existing default behaviors.

Highlights

Configurable Node Monitoring Parameters: Introduced monitorInterval and highUsageCountLimit as configurable parameters within the EvictingConfig for the node monitor. Previously, these values were hardcoded.
Backward Compatibility and Defaults: The new parameters maintain backward compatibility by defaulting to 10 seconds for monitorInterval and 6 consecutive detections for highUsageCountLimit, matching the previous hardcoded behavior.
Input Validation: Added validation to ensure that both monitorInterval and highUsageCountLimit are positive numbers when configured, preventing invalid settings.
Dynamic Interval Updates: The node monitor's run loop has been refactored to dynamically adjust the monitoring interval, allowing changes to monitorInterval to take effect without restarting the agent.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces configurability for the monitor interval and high usage limit, which is a valuable enhancement. The implementation of configuration and validation is well done. However, I've identified a critical deadlock issue due to incorrect lock ordering in node_monitor.go. Additionally, there's a potential performance problem with the use of time.Ticker for dynamic intervals that could lead to high CPU usage under certain conditions. I've provided detailed comments and code suggestions to address these issues.

pkg/agent/events/probes/nodemonitor/node_monitor.go

gemini-code-assist · 2026-01-07T07:43:29Z

pkg/agent/events/probes/nodemonitor/node_monitor.go

+func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) {
+	ticker := time.NewTicker(m.getMonitorInterval())
+	defer ticker.Stop()
+
+	for {
+		select {
+		case <-stop:
+			return
+		case <-ticker.C: 
+			f()
+			newInterval := m.getMonitorInterval()
+			ticker.Reset(newInterval)
+		}
+	}
+}


Using time.Ticker with Reset in this manner can lead to performance issues. If the function f() takes longer to execute than the ticker's interval, the ticker will still fire and buffer a tick. The next loop iteration will then execute f() immediately, without waiting for the new interval. This can cause f() to run in a tight loop, potentially causing high CPU usage.

A safer and more conventional approach for this pattern is to use time.Timer, which guarantees a delay between executions, regardless of how long f() takes.

func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) { interval := m.getMonitorInterval() timer := time.NewTimer(interval) defer timer.Stop() for { select { case <-stop: return case <-timer.C: f() newInterval := m.getMonitorInterval() if newInterval != interval { interval = newInterval timer.Reset(interval) } else { timer.Reset(interval) } } } }

pkg/agent/events/probes/nodemonitor/node_monitor.go

Add support for configurable monitor interval and high usage count limit in colocation scenarios. Previously these were hardcoded to 10 seconds and 6 counts respectively. The new configuration supports hot-reload without requiring agent restart. Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

Copilot

Pull request overview

This PR adds configurable parameters for node monitoring interval and high usage count limit in the colocation feature. Previously, these values were hardcoded (10 seconds and 6 consecutive detections respectively).

Key changes:

Added two new configuration fields (MonitorInterval and HighUsageCountLimit) to the Evicting config struct
Implemented dynamic interval updates using a custom ticker-based monitoring loop
Added validation to ensure both parameters are positive numbers

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
pkg/agent/config/api/types.go	Added `MonitorInterval` and `HighUsageCountLimit` fields to the `Evicting` struct with JSON tags
pkg/agent/config/api/validate.go	Added validation logic to ensure both new parameters are positive numbers
pkg/agent/config/utils/utils.go	Added default constants and updated default configuration and JSON template with the new parameters
pkg/agent/events/probes/nodemonitor/node_monitor.go	Replaced hardcoded constants with configurable fields, implemented dynamic interval monitoring with `runWithDynamicInterval`, and updated `RefreshCfg` and `nodeHasPressure` to use the new fields

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/agent/config/utils/utils.go

pkg/agent/events/probes/nodemonitor/node_monitor.go

Copilot · 2026-01-07T07:51:34Z

pkg/agent/config/api/validate.go

+	if e.MonitorInterval != nil && *e.MonitorInterval <= 0 {
+		errs = append(errs, errors.New(IllegalMonitorInterval))
+	}
+	if e.HighUsageCountLimit != nil && *e.HighUsageCountLimit <= 0 {
+		errs = append(errs, errors.New(IllegalHighUsageCountLimit))
+	}


The new validation logic for MonitorInterval and HighUsageCountLimit lacks test coverage. Since the repository contains tests for other validation logic in validate_test.go, test cases should be added to verify the behavior when these fields are set to invalid values (zero or negative numbers).

added tests for validate.go

Copilot · 2026-01-07T07:51:35Z

pkg/agent/events/probes/nodemonitor/node_monitor.go

+	// Update monitor interval if configured
+	if cfg.EvictingConfig != nil && cfg.EvictingConfig.MonitorInterval != nil {
+		m.monitorInterval = time.Duration(*cfg.EvictingConfig.MonitorInterval) * time.Second
+	}
+
+	// Update high usage count limit if configured
+	if cfg.EvictingConfig != nil && cfg.EvictingConfig.HighUsageCountLimit != nil {
+		m.highUsageCountLimit = *cfg.EvictingConfig.HighUsageCountLimit
+	}
+


The new dynamic interval update logic in RefreshCfg and the runWithDynamicInterval function lack test coverage. Since the repository contains tests for the monitor in node_monitor_test.go, test cases should be added to verify that monitor interval and high usage count limit updates are properly applied during runtime configuration refresh.

pkg/agent/events/probes/nodemonitor/node_monitor.go

pkg/agent/config/utils/utils.go

pkg/agent/config/api/types.go

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

hzxuzhonghu · 2026-01-08T09:24:22Z

/ok-to-test

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

Copilot AI review requested due to automatic review settings January 7, 2026 07:41

volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 7, 2026

volcano-sh-bot requested review from Monokaix and hzxuzhonghu January 7, 2026 07:41

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 7, 2026

Copilot started reviewing on behalf of kajal-jotwani January 7, 2026 07:41 View session

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

kajal-jotwani force-pushed the feature/4911-configurable-monitor-parameters branch from 355a2a9 to eb87e3d Compare January 7, 2026 07:47

volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 7, 2026

Copilot AI reviewed Jan 7, 2026

View reviewed changes

kajal-jotwani added 2 commits January 7, 2026 18:32

fix: remove trailing whitespace

5035d60

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

fix: node monitor locking and unnecessary ticker resets

5e2e0f1

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

kajal-jotwani force-pushed the feature/4911-configurable-monitor-parameters branch from 767e743 to 5e2e0f1 Compare January 7, 2026 13:04

test: add validation coverage for evicting monitor parameters

284e40d

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2026

volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 8, 2026

Remove unused wait package import

d704414

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>

[Colocation] Make monitor interval and high usage limit configurable #4924

Are you sure you want to change the base?

[Colocation] Make monitor interval and high usage limit configurable #4924

Conversation

kajal-jotwani commented Jan 7, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Jan 7, 2026

Uh oh!

volcano-sh-bot commented Jan 7, 2026

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

kajal-jotwani Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzxuzhonghu commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants