Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 18, 2025

What type of PR is this?

Bug fix

What this PR does / why we need it:

Scheduler panics with nil pointer dereference during pod scheduling when started without --enable-metrics=true. The Kubernetes scheduler framework plugins unconditionally access k8smetrics.Goroutines, which was only initialized when metrics were enabled.

Changes:

  • Moved metrics.InitKubeSchedulerRelatedMetrics() outside the if opt.EnableMetrics || opt.EnablePprof block in cmd/scheduler/app/server.go
  • Ensures k8smetrics.Goroutines is always initialized before scheduler starts
// Before: InitKubeSchedulerRelatedMetrics only called when metrics enabled
if opt.EnableMetrics || opt.EnablePprof {
    metrics.InitKubeSchedulerRelatedMetrics()
    go startMetricsServer(opt)
}

// After: Always initialize required metrics
metrics.InitKubeSchedulerRelatedMetrics()

if opt.EnableMetrics || opt.EnablePprof {
    go startMetricsServer(opt)
}

Which issue(s) this PR fixes:

Fixes #4729

Special notes for your reviewer:

The panic occurs in k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until at line 57 when calling Goroutines.WithLabelValues(). This metric must be initialized regardless of whether metrics export is enabled.

Does this PR introduce a user-facing change?

Fix scheduler panic when starting without --enable-metrics=true flag
Original prompt

This section details on the original issue you should resolve

<issue_title>volcano-scheduler v1.12.0 witchout --enable-metrics=true,panic while in pod scheduling</issue_title>
<issue_description>### Description

version: volcano v1.12.0
module: volcano-schedule
volcano-scheduler panic while in pod scheduling
the problem commit:981e18b2

root@ubuntu-212-117:/home/fjq/train# kubectl logs -n volcano-system volcano-scheduler-64c8f9df7f-v9ntx
2025/11/18 04:30:58 maxprocs: Updating GOMAXPROCS=5: determined from CPU quota
E1118 04:30:59.813753 1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue=""invalid memory address or nil pointer dereference"" stacktrace=<
goroutine 507 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x55e62dfe1398, 0x55e62f9c6460}, {0x55e62db46fe0, 0x55e62f8df530})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:107 +0xbc
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x55e62dfe1398, 0x55e62f9c6460}, {0x55e62db46fe0, 0x55e62f8df530}, {0x55e62f9c6460, 0x0, 0x10000c0005767e0?})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:82 +0x5e
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x55e62f9a3b40?})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:59 +0x108
panic({0x55e62db46fe0?, 0x55e62f8df530?})
/usr/local/go/src/runtime/panic.go:785 +0x132
k8s.io/component-base/metrics.(*GaugeVec).WithLabelValuesChecked(0x0, {0xc0000652e0, 0x1, 0x1})
/root/go/pkg/mod/k8s.io/component-base@v0.32.2/metrics/gauge.go:140 +0x32
k8s.io/component-base/metrics.(*GaugeVec).WithLabelValues(0x50?, {0xc0000652e0?, 0x7f94afaae108?, 0x50?})
/root/go/pkg/mod/k8s.io/component-base@v0.32.2/metrics/gauge.go:178 +0x1c
k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until({0xc0000653a8?}, {0x55e62dfe16e0, 0x55e62f9c6460}, 0x0, 0xc00019f180, {0x55e62d3ef5d8?, 0x0?})
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/parallelize/parallelism.go:57 +0x6a
k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity.(*InterPodAffinity).getExistingAntiAffinityCounts(0xc000e14d80, {0x55e62dfe16e0, 0x55e62f9c6460}, 0xc000948908, 0xc000dc1380, {0xc000ccd380, 0x0, 0x19})
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go:170 +0x17d
k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity.(*InterPodAffinity).PreFilter(0xc000e14d80, {0x55e62dfe16e0, 0x55e62f9c6460}, 0xc000e19840, 0xc000948908)
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go:254 +0x5bc
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc00110f880)
/root/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:435 +0x105
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PrePredicateFn(0xc000772008, 0xc00110f880)
/root/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:685 +0x118
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).allocateResourcesForTasks(0xc0004be738, 0xc000db27c0, 0xc000766a00, 0xc000693220, {0xc000af2d00, 0x19, 0x19}, {0x0, 0x0})
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:378 +0x927
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).allocateResources(0xc0004be738, 0xc000db2740, 0xc000065d08)
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:200 +0xba9
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute(0xc0004be738, 0xc000772008)
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:82 +0x314
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc0005aa300)
/root/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:125 +0x367
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000065f40?)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00063be40, {0x55e62dfb47c0, 0xc000367b30}, 0x1, 0xc0006a1260)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00063be40, 0x3b9aca00, 0x0, 0x1, 0xc0006a1260)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:161
created b...


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@volcano-sh-bot volcano-sh-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 18, 2025
… conditional

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix panic in volcano-scheduler when scheduling pods Fix scheduler panic when metrics are disabled Nov 18, 2025
Copilot AI requested a review from JesseStutler November 18, 2025 04:25
@JesseStutler JesseStutler marked this pull request as ready for review November 18, 2025 06:13
Copilot AI review requested due to automatic review settings November 18, 2025 06:13
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2025
@JesseStutler
Copy link
Member

/approve

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a critical scheduler panic that occurred when the scheduler was started without the --enable-metrics=true flag. The panic was caused by a nil pointer dereference when Kubernetes scheduler framework plugins attempted to access the k8smetrics.Goroutines metric, which was only initialized when metrics were enabled.

Key Changes:

  • Moved metrics.InitKubeSchedulerRelatedMetrics() outside the conditional metrics/pprof block to ensure it's always called
  • Added a clarifying comment explaining why this initialization must always occur regardless of metrics being enabled

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Nov 18, 2025
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, JesseStutler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot merged commit a762c6f into master Nov 18, 2025
42 of 46 checks passed
@JesseStutler
Copy link
Member

@copilot Could you also cherry-pick this pr into branch release-1.12 and release-1.13 and open two separate PRs for them?

@JesseStutler
Copy link
Member

/cherry-pick release-1.13

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: #4731 failed to apply on top of branch "release-1.13":

Patch is empty.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherry-pick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: new issue created for failed cherrypick: #4768

Details

In response to this:

/cherry-pick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI added a commit that referenced this pull request Dec 1, 2025
Move InitKubeSchedulerRelatedMetrics() outside the EnableMetrics/EnablePprof
conditional block to ensure k8smetrics.Goroutines is always initialized.
This prevents panic when scheduler starts without --enable-metrics=true.

Fixes #4729 (cherry-pick from #4731 to release-1.13)

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
@JesseStutler
Copy link
Member

/cherrypick release-1.13

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: #4731 failed to apply on top of branch "release-1.13":

Patch is empty.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: new issue created for failed cherrypick: #4834

Details

In response to this:

/cherrypick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JesseStutler
Copy link
Member

/cherrypick release-1.12

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: #4731 failed to apply on top of branch "release-1.12":

Patch is empty.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot
Copy link
Contributor

@JesseStutler: new issue created for failed cherrypick: #4837

Details

In response to this:

/cherrypick release-1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI added a commit that referenced this pull request Dec 22, 2025
Move InitKubeSchedulerRelatedMetrics() outside conditional block to ensure
k8smetrics.Goroutines is always initialized before scheduler starts.

This is a manual cherry-pick of PR #4731 to the release-1.12 branch.

Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

volcano-scheduler v1.12.0 witchout --enable-metrics=true,panic while in pod scheduling

4 participants