Skip to content

Feat/delete os pod #206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alvlkov
Copy link

@alvlkov alvlkov commented Dec 13, 2024

What type of PR is this?

This adds a new managed script to delete a pod from Openshift's reserved namespace.

What this PR does / Why we need it?

This will help fixing errors related to openshift reserved namespaces, essentially when pod restart is required.

Which Jira/Github issue(s) does this PR fix?

OSD_20528

Special notes for your reviewer

Pre-checks (if applicable)

  • Validated the changes in a ROSA stage cluster
  • Included documentation changes with PR

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alvlkov
Once this PR has been reviewed and has the lgtm label, please assign wanghaoran1988 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2025
@alvlkov
Copy link
Author

alvlkov commented Mar 18, 2025

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 18, 2025
Copy link
Contributor

@iamkirkbater iamkirkbater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested change relates to the name of the script. The additional check for a replicaset would be more of a nice-to-have, but we can also add that after this is merged so that we can start using this sooner rather than later.

@@ -0,0 +1,21 @@
# Delete Openshift Pod Script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to just be delete-pod instead of adding the delete-os-pod? From a UX perspective, it will be easier to remember the closer the syntax name is to the actual OC command.



main(){
delete_pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a huge lift here to validate if a pod is owned by a replicaset before proceeding? We might also need to add a "force" flag/parameter to bypass that as well, but it might be a nice protection for the rare chance that a pod isn't managed in an openshift namespace, this way we can make sure it will come back as a default behavior, but have the option to bypass it if we need to.

author: Alex Volkov
allowedGroups:
- CEE
- SREP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- SREP
- SREP
- MCSTierTwo

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the suggestions, thanks @iamkirkbater

@feichashao
Copy link
Contributor

Thanks @iamkirkbater for the review!

I would suggest we add the safeguard in this PR to validate if a pod is backed by a replicaset, otherwise the delete operation can be too wide.
The protection can be "we are not making the situation worse":

  • If we are deleting a non-healthy pod, go ahead and delete it.
  • If we are deleting a healthy pod,
    • If it is the only healthy pod in the replicaset, stop, raise a ticket and review it.
    • If there's another healthy pod besides the one we are going to delete, it is ok to delete.

Another nice-to-have is that we put a list of allowed namespace instead of openshift-*. This sound like a toil but it give us an opportunity to review if we want to allow the deletion when a new namespace comes. (can be next PR for this one).

@iamkirkbater
Copy link
Contributor

@feichashao - a few questions:

  1. Can you expand on what you mean by "non-healthy" pod? If we're asking for this in this PR I'd like to be explicit to what we are looking for. For example, if we just mean a "healthy" pod is one in a "Running" state, vs non-healthy which would be "Error", "Completed", "Pending" - etc.
  2. What specifically do you mean by raise a ticket - Do you mean like a JIRA here? Or would exiting out with an Error (if there's not a FORCE parameter set) work here?
  3. For the list of allowed namespaces - one thing I'd like to keep in mind here is that CEE/MCS have a wider scope of what they support than SREP does. While SREP may only limit ourselves to specific managed namespaces, CEE/MCS will be supporting additional things like openshift-virtualization, etc. So limiting them to managed namespaces may not be as efficient as we think it might be.

@feichashao
Copy link
Contributor

Can you expand on what you mean by "non-healthy" pod? If we're asking for this in this PR I'd like to be explicit to what we are looking for. For example, if we just mean a "healthy" pod is one in a "Running" state, vs non-healthy which would be "Error", "Completed", "Pending" - etc.

I would say Healthy = A pod with all containers in running state; The other should be non-healthy, eg, pending, crashloopbackoff, pod in running state but not all containers are running, showing like:

kube-apiserver-ip-10-119-135-4.ec2.internal           4/5     Running 

(I mocked this)

@alvlkov
Copy link
Author

alvlkov commented Apr 7, 2025

Added replicaset check and --force flag.

  • - Successfully deleted pod owned by replicaset regardless --force flag
  • - Couldn't delete a pod not owned by a replicaset without --force flag
  • - Successfully deleted pod not owned by replicaset with --force flag

@alvlkov alvlkov requested a review from iamkirkbater April 10, 2025 21:19
@alvlkov
Copy link
Author

alvlkov commented Jul 2, 2025

/retest

Copy link
Contributor

openshift-ci bot commented Jul 2, 2025

@alvlkov: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants