-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Fix scheduler panic when metrics are disabled #4731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… conditional Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
|
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a critical scheduler panic that occurred when the scheduler was started without the --enable-metrics=true flag. The panic was caused by a nil pointer dereference when Kubernetes scheduler framework plugins attempted to access the k8smetrics.Goroutines metric, which was only initialized when metrics were enabled.
Key Changes:
- Moved
metrics.InitKubeSchedulerRelatedMetrics()outside the conditional metrics/pprof block to ensure it's always called - Added a clarifying comment explaining why this initialization must always occur regardless of metrics being enabled
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
archlitchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, JesseStutler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@copilot Could you also cherry-pick this pr into branch |
|
/cherry-pick release-1.13 |
|
@JesseStutler: #4731 failed to apply on top of branch "release-1.13": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@JesseStutler: new issue created for failed cherrypick: #4768 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Move InitKubeSchedulerRelatedMetrics() outside the EnableMetrics/EnablePprof conditional block to ensure k8smetrics.Goroutines is always initialized. This prevents panic when scheduler starts without --enable-metrics=true. Fixes #4729 (cherry-pick from #4731 to release-1.13) Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
|
/cherrypick release-1.13 |
|
@JesseStutler: #4731 failed to apply on top of branch "release-1.13": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@JesseStutler: new issue created for failed cherrypick: #4834 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-1.12 |
|
@JesseStutler: #4731 failed to apply on top of branch "release-1.12": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@JesseStutler: new issue created for failed cherrypick: #4837 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Move InitKubeSchedulerRelatedMetrics() outside conditional block to ensure k8smetrics.Goroutines is always initialized before scheduler starts. This is a manual cherry-pick of PR #4731 to the release-1.12 branch. Co-authored-by: JesseStutler <38534065+JesseStutler@users.noreply.github.com>
What type of PR is this?
Bug fix
What this PR does / why we need it:
Scheduler panics with nil pointer dereference during pod scheduling when started without
--enable-metrics=true. The Kubernetes scheduler framework plugins unconditionally accessk8smetrics.Goroutines, which was only initialized when metrics were enabled.Changes:
metrics.InitKubeSchedulerRelatedMetrics()outside theif opt.EnableMetrics || opt.EnablePprofblock incmd/scheduler/app/server.gok8smetrics.Goroutinesis always initialized before scheduler startsWhich issue(s) this PR fixes:
Fixes #4729
Special notes for your reviewer:
The panic occurs in
k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Untilat line 57 when callingGoroutines.WithLabelValues(). This metric must be initialized regardless of whether metrics export is enabled.Does this PR introduce a user-facing change?
Original prompt
This section details on the original issue you should resolve
<issue_title>volcano-scheduler v1.12.0 witchout --enable-metrics=true,panic while in pod scheduling</issue_title>
<issue_description>### Description
version: volcano v1.12.0
module: volcano-schedule
volcano-scheduler panic while in pod scheduling
the problem commit:981e18b2
root@ubuntu-212-117:/home/fjq/train# kubectl logs -n volcano-system volcano-scheduler-64c8f9df7f-v9ntx
2025/11/18 04:30:58 maxprocs: Updating GOMAXPROCS=5: determined from CPU quota
E1118 04:30:59.813753 1 panic.go:262] "Observed a panic" panic="runtime error: invalid memory address or nil pointer dereference" panicGoValue=""invalid memory address or nil pointer dereference"" stacktrace=<
goroutine 507 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x55e62dfe1398, 0x55e62f9c6460}, {0x55e62db46fe0, 0x55e62f8df530})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:107 +0xbc
k8s.io/apimachinery/pkg/util/runtime.handleCrash({0x55e62dfe1398, 0x55e62f9c6460}, {0x55e62db46fe0, 0x55e62f8df530}, {0x55e62f9c6460, 0x0, 0x10000c0005767e0?})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:82 +0x5e
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x55e62f9a3b40?})
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/runtime/runtime.go:59 +0x108
panic({0x55e62db46fe0?, 0x55e62f8df530?})
/usr/local/go/src/runtime/panic.go:785 +0x132
k8s.io/component-base/metrics.(*GaugeVec).WithLabelValuesChecked(0x0, {0xc0000652e0, 0x1, 0x1})
/root/go/pkg/mod/k8s.io/component-base@v0.32.2/metrics/gauge.go:140 +0x32
k8s.io/component-base/metrics.(*GaugeVec).WithLabelValues(0x50?, {0xc0000652e0?, 0x7f94afaae108?, 0x50?})
/root/go/pkg/mod/k8s.io/component-base@v0.32.2/metrics/gauge.go:178 +0x1c
k8s.io/kubernetes/pkg/scheduler/framework/parallelize.Parallelizer.Until({0xc0000653a8?}, {0x55e62dfe16e0, 0x55e62f9c6460}, 0x0, 0xc00019f180, {0x55e62d3ef5d8?, 0x0?})
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/parallelize/parallelism.go:57 +0x6a
k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity.(*InterPodAffinity).getExistingAntiAffinityCounts(0xc000e14d80, {0x55e62dfe16e0, 0x55e62f9c6460}, 0xc000948908, 0xc000dc1380, {0xc000ccd380, 0x0, 0x19})
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go:170 +0x17d
k8s.io/kubernetes/pkg/scheduler/framework/plugins/interpodaffinity.(*InterPodAffinity).PreFilter(0xc000e14d80, {0x55e62dfe16e0, 0x55e62f9c6460}, 0xc000e19840, 0xc000948908)
/root/go/pkg/mod/k8s.io/kubernetes@v1.32.2/pkg/scheduler/framework/plugins/interpodaffinity/filtering.go:254 +0x5bc
volcano.sh/volcano/pkg/scheduler/plugins/predicates.(*predicatesPlugin).OnSessionOpen.func4(0xc00110f880)
/root/go/src/volcano.sh/volcano/pkg/scheduler/plugins/predicates/predicates.go:435 +0x105
volcano.sh/volcano/pkg/scheduler/framework.(*Session).PrePredicateFn(0xc000772008, 0xc00110f880)
/root/go/src/volcano.sh/volcano/pkg/scheduler/framework/session_plugins.go:685 +0x118
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).allocateResourcesForTasks(0xc0004be738, 0xc000db27c0, 0xc000766a00, 0xc000693220, {0xc000af2d00, 0x19, 0x19}, {0x0, 0x0})
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:378 +0x927
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).allocateResources(0xc0004be738, 0xc000db2740, 0xc000065d08)
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:200 +0xba9
volcano.sh/volcano/pkg/scheduler/actions/allocate.(*Action).Execute(0xc0004be738, 0xc000772008)
/root/go/src/volcano.sh/volcano/pkg/scheduler/actions/allocate/allocate.go:82 +0x314
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc0005aa300)
/root/go/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:125 +0x367
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc000065f40?)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00063be40, {0x55e62dfb47c0, 0xc000367b30}, 0x1, 0xc0006a1260)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00063be40, 0x3b9aca00, 0x0, 0x1, 0xc0006a1260)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/root/go/pkg/mod/k8s.io/apimachinery@v0.32.2/pkg/util/wait/backoff.go:161
created b...
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.