Feature: OpenShift Virtualization Higher Density #1679

iholder101 · 2024-09-17T10:54:55Z

A feature describing OpenShift Virtualization's path to higher density based on:

Phase 1: WASP - a downstream-only agent.
Phase 2: Kubernetes swap feature (Node memory swap support kubernetes/enhancements#2400)

This is a replacement for #1630.

openshift-ci · 2024-09-17T10:55:17Z

Hi @iholder101. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

iholder101 · 2024-09-17T11:04:18Z

/cc @enp0s3 @Barakmor1 @fabiand @mrunalp @haircommander @kannon92

haircommander · 2024-09-17T16:53:49Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+#### Timeline
+
+* GA higher workload density in OpenShift Virtualization in 2024


is this timeline still accurate?

Yes, I believe it is.
Please keep me honest @stu-gott.

@haircommander Hi, I've checked the Jira planning, we are on track, so yes this is indeed accurate.

@haircommander The timeline bullet GA higher workload density in OpenShift Virtualization in 2024 relates to the phase 1 only. Maybe we should add it in brackets

Hey @haircommander!
@enp0s3 and I reworked the PR. Can you please have another look?

enp0s3 · 2024-09-20T07:47:18Z

/ok-to-test

A feature describing CNV's path to higher density based on - phase 1: wasp - phase 2: kube swap Signed-off-by: Itamar Holder <iholder@redhat.com>

Signed-off-by: Itamar Holder <iholder@redhat.com>

openshift-bot · 2024-10-24T01:15:54Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2024-10-31T08:45:18Z

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

enp0s3 · 2024-11-07T13:55:59Z

/remove-lifecycle rotten

openshift-bot · 2024-12-06T01:15:43Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

enp0s3 · 2024-12-06T08:23:34Z

/remove-lifecycle stale

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

openshift-ci · 2024-12-25T16:22:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign coverprice for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-12-25T16:45:31Z

@iholder101: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dankenigsberg

Thanks for this proposal for a much-requested feature. I think it is high time to have it merged.

dankenigsberg · 2024-12-26T07:07:01Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Summary
+
+Fit more workloads onto a given node - achieve a higher workload
+density - by overcommitting it's memory resources. Due to timeline


Suggested change

density - by overcommitting it's memory resources. Due to timeline

density - by overcommitting its memory resources. Due to timeline

dankenigsberg · 2024-12-26T07:07:59Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Motivation
+
+Today, OpenShift Virtualization is reserving memory (`requests.memory`)
+according to the needs of the virtual machine and it's infrastructure


Suggested change

according to the needs of the virtual machine and it's infrastructure

according to the needs of the virtual machine and its infrastructure

dankenigsberg · 2024-12-26T07:13:57Z

enhancements/kubelet/virtualization-higher-workload-density.md

+given node leads to the observation that _on average_ there is no memory
+ressure and often a rather low memory utilization - despite the fact that
+much memory has been reserved.


Suggested change

given node leads to the observation that _on average_ there is no memory

ressure and often a rather low memory utilization - despite the fact that

much memory has been reserved.

given node leads to the observation that _on average_ much of the reserved memory is not utilized.

I think it is not precises to say there is no pressure. There is. But we can reduce it, because the memory causing the pressure is not used and can be swapped out.

dankenigsberg · 2024-12-26T07:16:55Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+### Non-Goals
+
+* Complete life-cycling of the WASP Agent. We are not intending to write


This is the first time WASP agent is mentioned. Please add a URL.

dankenigsberg · 2024-12-26T07:17:43Z

enhancements/kubelet/virtualization-higher-workload-density.md

+* Complete life-cycling of the WASP Agent. We are not intending to write
+  an Operator for memory over commit for two reasons:
+  * [Kubernetes SWAP] is close, writing a fully fledged operator seems
+    to be no good use of resources


Suggested change

to be no good use of resources

to be no good use of developer resources

dankenigsberg · 2024-12-26T13:03:09Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Test Plan
+
+Add e2e tests for the WASP agent repository for regression testing against
+OpenShift.


I think that we should include here a bit more details about how we are (already) testing it. Most importantly: configure 200% over-commitment, fill up the cluster with dormant VMs and verify that the cluster is responsive and survives upgrade.

dankenigsberg · 2024-12-26T13:05:53Z

enhancements/kubelet/virtualization-higher-workload-density.md

+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to keep previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to make use of the enhancement?
+
+Upgrade expectations:
+- Each component should remain available for user requests and
+  workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
+  disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
+  this should be identified and discussed here.
+- Micro version upgrades - users should be able to skip forward versions within a
+  minor release stream without being required to pass through intermediate
+  versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
+  as an intermediate step.
+- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
+  steps. So, for example, it is acceptable to require a user running 4.3 to
+  upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
+- While an upgrade is in progress, new component versions should
+  continue to operate correctly in concert with older component
+  versions (aka "version skew"). For example, if a node is down, and
+  an operator is rolling out a daemonset, the old and new daemonset
+  pods must continue to work correctly even while the cluster remains
+  in this partially upgraded state for some time.


this seems like generic content. should we not replace it with something specific, or drop it?

dankenigsberg · 2024-12-26T13:07:31Z

enhancements/kubelet/virtualization-higher-workload-density.md

+How will the component handle version skew with other components?
+What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- During an upgrade, we will always have skew among components, how will this impact your work?
+- Does this enhancement involve coordinating behavior in the control plane and
+  in the kubelet? How does an n-2 kubelet without this feature available behave
+  when this feature is used?
+- Will any other components on the node change? For example, changes to CSI, CRI
+  or CNI may require updating that component before the kubelet.


WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y+1)

dankenigsberg · 2024-12-26T13:11:19Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+## Operational Aspects of API Extensions
+
+None


I think that this is a place to discuss the fact that all workers have to have to same memory size and the same disk topology, that deploying and upgrading WASP is a manual step

dankenigsberg · 2024-12-26T13:11:51Z

enhancements/kubelet/virtualization-higher-workload-density.md

+  Examples:
+  - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
+    will not block the creation or updates on objects when it fails. When the
+    webhook comes back online, there is a controller reconciling all objects, applying
+    labels that were not applied during admission webhook downtime.
+  - Namespaces deletion will not delete all objects in etcd, leading to zombie
+    objects when another namespace with the same name is created.
+
+TBD


Let us replace this generic content.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2024

iholder101 marked this pull request as ready for review September 17, 2024 11:02

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024

openshift-ci bot requested review from jupierce and rvanderp3 September 17, 2024 11:03

iholder101 mentioned this pull request Sep 17, 2024

feat: OpenShift Virtualization Higher Density #1630

Closed

openshift-ci bot requested review from Barakmor1, enp0s3, fabiand, haircommander, kannon92 and mrunalp September 17, 2024 11:04

haircommander reviewed Sep 17, 2024

View reviewed changes

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2024

fabiand added 11 commits September 25, 2024 12:11

feat: OpenShift Virtualization Higher Density

df71f18

A feature describing CNV's path to higher density based on - phase 1: wasp - phase 2: kube swap Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Swap add more design details

d23c0b8

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Drop all xml style comments to make the md linter happy

4307e06

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Address Igor's comments

4b42c68

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Address Itamar's comments

cf6cc95

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Add line breaks

1315f82

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: fix headers

d444076

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Fix table header

1708d57

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Mention KSM and FPR

1a7ae77

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv: Add section to make linter happy

36d6745

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv,swap Fix metadata

796a6b0

Signed-off-by: Itamar Holder <iholder@redhat.com>

fabiand and others added 6 commits September 25, 2024 12:11

cnv,swap: Clarification of VM vs infra overhead

c2fd76e

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv,swap: Address more review comments

6b5698c

Signed-off-by: Itamar Holder <iholder@redhat.com>

cnv,swap: Addressed more comments

64993c1

Signed-off-by: Itamar Holder <iholder@redhat.com>

Swap GA for CNV users only

ef87c0d

Signed-off-by: Itamar Holder <iholder@redhat.com>

Reorg timeline & phases

f50041e

Signed-off-by: Itamar Holder <iholder@redhat.com>

General clarification and cleanning

a30cc7c

Signed-off-by: Itamar Holder <iholder@redhat.com>

iholder101 force-pushed the cnv-swap-2itr branch from e7d1686 to a30cc7c Compare September 25, 2024 09:11

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2024

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 31, 2024

openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 7, 2024

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024

enp0s3 added 4 commits December 25, 2024 14:24

updated non-goals section

e4165d6

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

updated scope section

961654f

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

udpdated timeline & phases section

c841734

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

updated details related to the tech-preview release

8ee2bc3

Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

dankenigsberg suggested changes Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: OpenShift Virtualization Higher Density #1679

Feature: OpenShift Virtualization Higher Density #1679

iholder101 commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

iholder101 commented Sep 17, 2024

haircommander Sep 17, 2024

iholder101 Sep 18, 2024

enp0s3 Sep 24, 2024

enp0s3 Sep 24, 2024

iholder101 Sep 25, 2024

enp0s3 commented Sep 20, 2024

openshift-bot commented Oct 24, 2024

openshift-bot commented Oct 31, 2024

enp0s3 commented Nov 7, 2024

openshift-bot commented Dec 6, 2024

enp0s3 commented Dec 6, 2024

openshift-ci bot commented Dec 25, 2024

openshift-ci bot commented Dec 25, 2024

dankenigsberg left a comment

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024

dankenigsberg Dec 26, 2024


		#### Timeline

		* GA higher workload density in OpenShift Virtualization in 2024

	density - by overcommitting it's memory resources. Due to timeline
	density - by overcommitting its memory resources. Due to timeline

	according to the needs of the virtual machine and it's infrastructure
	according to the needs of the virtual machine and its infrastructure


		### Non-Goals

		* Complete life-cycling of the WASP Agent. We are not intending to write

	to be no good use of resources
	to be no good use of developer resources

Feature: OpenShift Virtualization Higher Density #1679

Are you sure you want to change the base?

Feature: OpenShift Virtualization Higher Density #1679

Conversation

iholder101 commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

iholder101 commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enp0s3 commented Sep 20, 2024

openshift-bot commented Oct 24, 2024

openshift-bot commented Oct 31, 2024

enp0s3 commented Nov 7, 2024

openshift-bot commented Dec 6, 2024

enp0s3 commented Dec 6, 2024

openshift-ci bot commented Dec 25, 2024

openshift-ci bot commented Dec 25, 2024

dankenigsberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment