Skip to content

Conversation

@kajal-jotwani
Copy link

What type of PR is this?

kind/feature
area colocation

What this PR does / why we need it:

Currently, the node monitor uses hardcoded values for monitoring
interval (10 seconds) and high usage count limit (6 consecutive
detections).
This change adds two new configurable parameters to EvictingConfig:

  • monitorInterval: controls how often node resources are monitored
  • highUsageCountLimit: defines consecutive high readings before eviction

Both parameters maintain backward compatibility with defaults of
10 seconds and 6 times respectively. The values are validated to
ensure they are positive numbers and can be configured via the
volcano-agent ConfigMap.

Which issue(s) this PR fixes:

Fixes #4911

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Copilot AI review requested due to automatic review settings January 7, 2026 07:41
@volcano-sh-bot volcano-sh-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 7, 2026
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jessestutler for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot
Copy link
Contributor

Welcome @kajal-jotwani! It looks like this is your first PR to volcano-sh/volcano 🎉

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 7, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @kajal-jotwani, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the node colocation feature by making key monitoring parameters configurable. It allows administrators to customize how frequently node resources are monitored and the threshold for consecutive high usage detections before eviction is triggered. This change improves flexibility and adaptability for different operational environments while ensuring data integrity through input validation and maintaining existing default behaviors.

Highlights

  • Configurable Node Monitoring Parameters: Introduced monitorInterval and highUsageCountLimit as configurable parameters within the EvictingConfig for the node monitor. Previously, these values were hardcoded.
  • Backward Compatibility and Defaults: The new parameters maintain backward compatibility by defaulting to 10 seconds for monitorInterval and 6 consecutive detections for highUsageCountLimit, matching the previous hardcoded behavior.
  • Input Validation: Added validation to ensure that both monitorInterval and highUsageCountLimit are positive numbers when configured, preventing invalid settings.
  • Dynamic Interval Updates: The node monitor's run loop has been refactored to dynamically adjust the monitoring interval, allowing changes to monitorInterval to take effect without restarting the agent.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces configurability for the monitor interval and high usage limit, which is a valuable enhancement. The implementation of configuration and validation is well done. However, I've identified a critical deadlock issue due to incorrect lock ordering in node_monitor.go. Additionally, there's a potential performance problem with the use of time.Ticker for dynamic intervals that could lead to high CPU usage under certain conditions. I've provided detailed comments and code suggestions to address these issues.

Comment on lines 92 to 106
func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) {
ticker := time.NewTicker(m.getMonitorInterval())
defer ticker.Stop()

for {
select {
case <-stop:
return
case <-ticker.C:
f()
newInterval := m.getMonitorInterval()
ticker.Reset(newInterval)
}
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using time.Ticker with Reset in this manner can lead to performance issues. If the function f() takes longer to execute than the ticker's interval, the ticker will still fire and buffer a tick. The next loop iteration will then execute f() immediately, without waiting for the new interval. This can cause f() to run in a tight loop, potentially causing high CPU usage.

A safer and more conventional approach for this pattern is to use time.Timer, which guarantees a delay between executions, regardless of how long f() takes.

func (m *monitor) runWithDynamicInterval(f func(), stop <-chan struct{}) {
	interval := m.getMonitorInterval()
	timer := time.NewTimer(interval)
	defer timer.Stop()

	for {
		select {
		case <-stop:
			return
		case <-timer.C:
			f()
			newInterval := m.getMonitorInterval()
			if newInterval != interval {
				interval = newInterval
				timer.Reset(interval)
			} else {
				timer.Reset(interval)
			}
		}
	}
}

Add support for configurable monitor interval and high usage
count limit in colocation scenarios. Previously these were
hardcoded to 10 seconds and 6 counts respectively.
The new configuration supports hot-reload without requiring
agent restart.

Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
@kajal-jotwani kajal-jotwani force-pushed the feature/4911-configurable-monitor-parameters branch from 355a2a9 to eb87e3d Compare January 7, 2026 07:47
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 7, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds configurable parameters for node monitoring interval and high usage count limit in the colocation feature. Previously, these values were hardcoded (10 seconds and 6 consecutive detections respectively).

Key changes:

  • Added two new configuration fields (MonitorInterval and HighUsageCountLimit) to the Evicting config struct
  • Implemented dynamic interval updates using a custom ticker-based monitoring loop
  • Added validation to ensure both parameters are positive numbers

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
pkg/agent/config/api/types.go Added MonitorInterval and HighUsageCountLimit fields to the Evicting struct with JSON tags
pkg/agent/config/api/validate.go Added validation logic to ensure both new parameters are positive numbers
pkg/agent/config/utils/utils.go Added default constants and updated default configuration and JSON template with the new parameters
pkg/agent/events/probes/nodemonitor/node_monitor.go Replaced hardcoded constants with configurable fields, implemented dynamic interval monitoring with runWithDynamicInterval, and updated RefreshCfg and nodeHasPressure to use the new fields

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +131 to +136
if e.MonitorInterval != nil && *e.MonitorInterval <= 0 {
errs = append(errs, errors.New(IllegalMonitorInterval))
}
if e.HighUsageCountLimit != nil && *e.HighUsageCountLimit <= 0 {
errs = append(errs, errors.New(IllegalHighUsageCountLimit))
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new validation logic for MonitorInterval and HighUsageCountLimit lacks test coverage. Since the repository contains tests for other validation logic in validate_test.go, test cases should be added to verify the behavior when these fields are set to invalid values (zero or negative numbers).

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests for validate.go

Comment on lines 118 to 127
// Update monitor interval if configured
if cfg.EvictingConfig != nil && cfg.EvictingConfig.MonitorInterval != nil {
m.monitorInterval = time.Duration(*cfg.EvictingConfig.MonitorInterval) * time.Second
}

// Update high usage count limit if configured
if cfg.EvictingConfig != nil && cfg.EvictingConfig.HighUsageCountLimit != nil {
m.highUsageCountLimit = *cfg.EvictingConfig.HighUsageCountLimit
}

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dynamic interval update logic in RefreshCfg and the runWithDynamicInterval function lack test coverage. Since the repository contains tests for the monitor in node_monitor_test.go, test cases should be added to verify that monitor interval and high usage count limit updates are properly applied during runtime configuration refresh.

Copilot uses AI. Check for mistakes.
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
@kajal-jotwani kajal-jotwani force-pushed the feature/4911-configurable-monitor-parameters branch from 767e743 to 5e2e0f1 Compare January 7, 2026 13:04
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 7, 2026
@hzxuzhonghu
Copy link
Member

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jan 8, 2026
Signed-off-by: kajal-jotwani <kajaljotwani06@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Colocation Enhancement] Support configurable node monitor parameters

3 participants