-
Notifications
You must be signed in to change notification settings - Fork 385
Orphaned Service Catalog objects after Namespace deletion during Service Catalog API server outage #2254
Comments
The issue is caused by the
In the happy path it works fine, but when API server goes down, it unregisters itself from the list of "API versions", which you can manually check via @liggitt I believe this logic (and corresponding issue) applies not only to Service Catalog API server, but to API Extensions (CRD) API server as well, and to any other non-core API server. What is the recommended approach to correctly handle this situation? |
While this issue may sound like a very rare corner-case, in reality it's easy to run into this issue with long asynchronous operations and finalizers. Example:
Also, it's easy to run into this issue when Service Catalog crashes due to some bug with just one replica in the deployment (which may be not easy to notice since Kubernetes silently restarts the crashing pod). |
Why does step 3 happen if svc-cat didn’t complete its job yet?
…Sent from my iPad
On Aug 2, 2018, at 6:33 AM, Nail Islamov ***@***.***> wrote:
While this issue may sound like a very rare corner-case, in reality it's easy to run into this issue with long asynchronous operations and finalizers. Example:
ServiceInstance is created, and asynchronous provisioning is started.
Namespace is marked for deletion, and remains in Terminating state, because ServiceInstance has a kubernetes/service-catalog finalizer which blocks its deletion.
Once there is an outage of Service Catalog API server (e.g. possibly during update with a downtime), namespace gets deleted.
Service Catalog continues processing ServiceInstance up until it is ready to remove the finalizer, but the namespace is gone already.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Ah never mind, was reading emails out of order
…Sent from my iPad
On Aug 2, 2018, at 6:25 AM, Nail Islamov ***@***.***> wrote:
The issue is caused by the Namespace controller deletion logic:
Namespace controller fetches a list of resource types (GroupVersionResource) supported by every API server via fetching an APIResourceList object, see https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/namespace/deletion/namespaced_resources_deleter.go#L484
Then, for every GroupVersionResource it will issue a delete request to a corresponding API server
Once all known resources deleted, Namespace controller will remove a finalizer from the Namespace and will proceed with deletion.
In the happy path it works fine, but when API server goes down, it unregisters itself from the list of "API versions", which you can manually check via kubectl api-versions (for Service Catalog you'll find servicecatalog.k8s.io/v1beta1 in this list). Since a particular apiVersion is not registered anymore, Namespace controller doesn't "see" corresponding types of resources anymore, and leaves them untouched.
@liggitt I believe this logic (and corresponding issue) applies not only to Service Catalog API server, but to API Extensions (CRD) API server as well, and to any other non-core API server. What is the recommended approach to correctly handle this situation?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
What version are you running? This should be leaving the APIService registered. If the backing server is unavailable, the aggregator will return 503 errors when accessed, which would fail resource discovery and block namespace deletion. The 503 response handling was fixed in kubernetes/kubernetes#58070 |
We are running k8s 1.11 @liggitt Based on the behavior I see, I think the Namespace controller is not sending a request directly to the Service Catalog API server though. It just sends a single request to the aggregated server discovery endpoint, which handles unavailable API servers differently: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_apis.go#L95-L97 When Service Catalog API server is available, corresponding APIService status is: status:
conditions:
- lastTransitionTime: 2018-08-02T14:20:42Z
message: all checks passed
reason: Passed
status: "True"
type: Available And
When I scale Service Catalog deployment to zero, I get the status: status:
conditions:
- lastTransitionTime: 2018-08-02T14:19:32Z
message: endpoints for service/service-catalog-apiserver in "service-catalog"
have no addresses
reason: MissingEndpoints
status: "False"
type: Available But As a result, Namespace controller doesn't know about the existence of Service Catalog API server, so it proceeds with the namespace deletion without blocking. |
It is possible that the command line api-versions command is tolerating a partial discovery error. The namespace controller does not. What is the response/content of the following endpoints: /apis |
Can you fetch the non-truncated output of those endpoints when the APIService is in the unavailable state? |
@liggitt sure:
{
"kind": "APIGroupList",
"apiVersion": "v1",
"groups": [
{
"name": "apiregistration.k8s.io",
"versions": [
{
"groupVersion": "apiregistration.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "apiregistration.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "apiregistration.k8s.io/v1",
"version": "v1"
}
},
{
"name": "extensions",
"versions": [
{
"groupVersion": "extensions/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "extensions/v1beta1",
"version": "v1beta1"
}
},
{
"name": "apps",
"versions": [
{
"groupVersion": "apps/v1",
"version": "v1"
},
{
"groupVersion": "apps/v1beta2",
"version": "v1beta2"
},
{
"groupVersion": "apps/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "apps/v1",
"version": "v1"
}
},
{
"name": "events.k8s.io",
"versions": [
{
"groupVersion": "events.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "events.k8s.io/v1beta1",
"version": "v1beta1"
}
},
{
"name": "authentication.k8s.io",
"versions": [
{
"groupVersion": "authentication.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "authentication.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "authentication.k8s.io/v1",
"version": "v1"
}
},
{
"name": "authorization.k8s.io",
"versions": [
{
"groupVersion": "authorization.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "authorization.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "authorization.k8s.io/v1",
"version": "v1"
}
},
{
"name": "autoscaling",
"versions": [
{
"groupVersion": "autoscaling/v1",
"version": "v1"
},
{
"groupVersion": "autoscaling/v2beta1",
"version": "v2beta1"
}
],
"preferredVersion": {
"groupVersion": "autoscaling/v1",
"version": "v1"
}
},
{
"name": "batch",
"versions": [
{
"groupVersion": "batch/v1",
"version": "v1"
},
{
"groupVersion": "batch/v1beta1",
"version": "v1beta1"
},
{
"groupVersion": "batch/v2alpha1",
"version": "v2alpha1"
}
],
"preferredVersion": {
"groupVersion": "batch/v1",
"version": "v1"
}
},
{
"name": "certificates.k8s.io",
"versions": [
{
"groupVersion": "certificates.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "certificates.k8s.io/v1beta1",
"version": "v1beta1"
}
},
{
"name": "networking.k8s.io",
"versions": [
{
"groupVersion": "networking.k8s.io/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "networking.k8s.io/v1",
"version": "v1"
}
},
{
"name": "policy",
"versions": [
{
"groupVersion": "policy/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "policy/v1beta1",
"version": "v1beta1"
}
},
{
"name": "rbac.authorization.k8s.io",
"versions": [
{
"groupVersion": "rbac.authorization.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "rbac.authorization.k8s.io/v1beta1",
"version": "v1beta1"
},
{
"groupVersion": "rbac.authorization.k8s.io/v1alpha1",
"version": "v1alpha1"
}
],
"preferredVersion": {
"groupVersion": "rbac.authorization.k8s.io/v1",
"version": "v1"
}
},
{
"name": "storage.k8s.io",
"versions": [
{
"groupVersion": "storage.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "storage.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "storage.k8s.io/v1",
"version": "v1"
}
},
{
"name": "admissionregistration.k8s.io",
"versions": [
{
"groupVersion": "admissionregistration.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "admissionregistration.k8s.io/v1beta1",
"version": "v1beta1"
}
},
{
"name": "apiextensions.k8s.io",
"versions": [
{
"groupVersion": "apiextensions.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "apiextensions.k8s.io/v1beta1",
"version": "v1beta1"
}
},
{
"name": "scheduling.k8s.io",
"versions": [
{
"groupVersion": "scheduling.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "scheduling.k8s.io/v1beta1",
"version": "v1beta1"
}
},
{
"name": "creator.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "creator.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "creator.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "reporter.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "reporter.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "reporter.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "ops-gateway.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "ops-gateway.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "ops-gateway.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "composition.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "composition.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "composition.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "formation.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "formation.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "formation.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "ops.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "ops.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "ops.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "orchestration.voyager.atl-paas.net",
"versions": [
{
"groupVersion": "orchestration.voyager.atl-paas.net/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "orchestration.voyager.atl-paas.net/v1",
"version": "v1"
}
},
{
"name": "smith.atlassian.com",
"versions": [
{
"groupVersion": "smith.atlassian.com/v1",
"version": "v1"
}
],
"preferredVersion": {
"groupVersion": "smith.atlassian.com/v1",
"version": "v1"
}
},
{
"name": "bitnami.com",
"versions": [
{
"groupVersion": "bitnami.com/v1alpha1",
"version": "v1alpha1"
}
],
"preferredVersion": {
"groupVersion": "bitnami.com/v1alpha1",
"version": "v1alpha1"
}
},
{
"name": "clusterregistry.k8s.io",
"versions": [
{
"groupVersion": "clusterregistry.k8s.io/v1alpha1",
"version": "v1alpha1"
}
],
"preferredVersion": {
"groupVersion": "clusterregistry.k8s.io/v1alpha1",
"version": "v1alpha1"
}
},
{
"name": "contour.heptio.com",
"versions": [
{
"groupVersion": "contour.heptio.com/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "contour.heptio.com/v1beta1",
"version": "v1beta1"
}
},
{
"name": "metrics.k8s.io",
"versions": [
{
"groupVersion": "metrics.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "metrics.k8s.io/v1beta1",
"version": "v1beta1"
}
}
]
}
|
k8s exact version:
|
From Namespace controller does the same (plus extra methods based on the list returned in
|
Automatic merge from submit-queue (batch tested with PRs 66394, 66888, 66932). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Include unavailable apiservices in discovery response **What this PR does / why we need it**: Include unavailable apiservices into `apis/` discovery endpoint response to fix namespace deletion kubernetes-retired/service-catalog#2254 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes kubernetes-retired/service-catalog#2254 **Special notes for your reviewer**: **Release note**: ```release-note kube-apiserver now includes all registered API groups in discovery, including registered extension API group/versions for unavailable extension API servers. ``` Kubernetes-commit: 28d649c2f5d025c2dd3b2d3c0e39fb6ca7a5b527
Automatic merge from submit-queue (batch tested with PRs 66394, 66888, 66932). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Include unavailable apiservices in discovery response **What this PR does / why we need it**: Include unavailable apiservices into `apis/` discovery endpoint response to fix namespace deletion kubernetes-retired/service-catalog#2254 **Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*: Fixes kubernetes-retired/service-catalog#2254 **Special notes for your reviewer**: **Release note**: ```release-note kube-apiserver now includes all registered API groups in discovery, including registered extension API group/versions for unavailable extension API servers. ``` Kubernetes-commit: 28d649c2f5d025c2dd3b2d3c0e39fb6ca7a5b527
The issue is fixed in Kubernetes master, will keep this issue open until it gets cherry picked to k8s 1.11, released and we update dependency to a newer version. |
@nilebox Sorry -- updating my branch closed this by accident. |
k8s 1.11.3 has just been released https://github.com/kubernetes/kubernetes/releases/tag/v1.11.3 |
We can close this issue without updating dependencies to kubernetes 1.11.3, since the bug was in the Kubernetes itself, not in the Service Catalog using the library. If Service Catalog is running in Kubernetes 1.11.3 or 1.12+, the bug won't occur. |
Bug Report
What happened:
During Service Catalog API server outage, namespace deletion proceeds without waiting for Service Catalog objects to finish their deletion.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
ns1
ServiceInstance
object in namespacens1
replicas: 0
in the deployment)ns1
namespacens1
is missing, butServiceInstance
object in this namespace still exists. In some cases, it is impossible to delete this object until you manually createns1
namespace again.The text was updated successfully, but these errors were encountered: