Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Orphaned Service Catalog objects after Namespace deletion during Service Catalog API server outage #2254

Closed
nilebox opened this issue Aug 2, 2018 · 15 comments · Fixed by kubernetes/kubernetes#66932
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@nilebox
Copy link
Contributor

nilebox commented Aug 2, 2018

Bug Report

What happened:
During Service Catalog API server outage, namespace deletion proceeds without waiting for Service Catalog objects to finish their deletion.

What you expected to happen:

  1. Namespace shouldn't proceed with deletion during Service Catalog outage. We may add Service Catalog finalizer to every namespace to achieve this.
  2. Otherwise, when we reconcile Service Catalog object and detect that its parent namespace is missing, we should probably initiate its deletion?

How to reproduce it (as minimally and precisely as possible):

  1. Create namespace ns1
  2. Create a ServiceInstance object in namespace ns1
  3. Undeploy Service Catalog API server (you could simply set replicas: 0 in the deployment)
  4. Delete ns1 namespace
  5. Deploy Service Catalog API server
  6. Namespace ns1 is missing, but ServiceInstance object in this namespace still exists. In some cases, it is impossible to delete this object until you manually create ns1 namespace again.
@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

The issue is caused by the Namespace controller deletion logic:

In the happy path it works fine, but when API server goes down, it unregisters itself from the list of "API versions", which you can manually check via kubectl api-versions (for Service Catalog you'll find servicecatalog.k8s.io/v1beta1 in this list, and when it's down it will be missing). Since a particular apiVersion is not registered anymore, Namespace controller doesn't "see" corresponding types of resources anymore, and leaves them untouched.

@liggitt I believe this logic (and corresponding issue) applies not only to Service Catalog API server, but to API Extensions (CRD) API server as well, and to any other non-core API server. What is the recommended approach to correctly handle this situation?

@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

While this issue may sound like a very rare corner-case, in reality it's easy to run into this issue with long asynchronous operations and finalizers. Example:

  1. ServiceInstance is created, and asynchronous provisioning is started.
  2. Namespace is marked for deletion, and remains in Terminating state, because ServiceInstance has a kubernetes/service-catalog finalizer which blocks its deletion.
  3. Once there is an outage of Service Catalog API server (e.g. possibly during update with a downtime), namespace gets deleted.
  4. Service Catalog continues processing ServiceInstance up until it is ready to remove the finalizer, but the namespace is gone already.

Also, it's easy to run into this issue when Service Catalog crashes due to some bug with just one replica in the deployment (which may be not easy to notice since Kubernetes silently restarts the crashing pod).

@duglin
Copy link
Contributor

duglin commented Aug 2, 2018 via email

@duglin
Copy link
Contributor

duglin commented Aug 2, 2018 via email

@liggitt
Copy link

liggitt commented Aug 2, 2018

when API server goes down, it unregisters itself from the list of "API versions"

What version are you running? This should be leaving the APIService registered. If the backing server is unavailable, the aggregator will return 503 errors when accessed, which would fail resource discovery and block namespace deletion. The 503 response handling was fixed in kubernetes/kubernetes#58070

@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

What version are you running?

We are running k8s 1.11

@liggitt Based on the behavior I see, I think the 503 change you referenced only applies to a proxy that intercepts request to an actual API server, i.e. when you will try to fetch an APIResourceList from a missing API server, you will get 503 from aggregator without it sending actual request to the underlying server.

Namespace controller is not sending a request directly to the Service Catalog API server though. It just sends a single request to the aggregated server discovery endpoint, which handles unavailable API servers differently: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-aggregator/pkg/apiserver/handler_apis.go#L95-L97
i.e. it just skips APIService objects that don't have Available: true condition.


When Service Catalog API server is available, corresponding APIService status is:

status:
  conditions:
  - lastTransitionTime: 2018-08-02T14:20:42Z
    message: all checks passed
    reason: Passed
    status: "True"
    type: Available

And kubectl api-versions result contains Service Catalog:

❯ kubectl api-versions | grep servicecatalog
servicecatalog.k8s.io/v1beta1

When I scale Service Catalog deployment to zero, I get the status:

status:
  conditions:
  - lastTransitionTime: 2018-08-02T14:19:32Z
    message: endpoints for service/service-catalog-apiserver in "service-catalog"
      have no addresses
    reason: MissingEndpoints
    status: "False"
    type: Available

But kubectl api-versions returns the result without errors, it's just missing the Service Catalog group.

As a result, Namespace controller doesn't know about the existence of Service Catalog API server, so it proceeds with the namespace deletion without blocking.

@liggitt
Copy link

liggitt commented Aug 2, 2018

It is possible that the command line api-versions command is tolerating a partial discovery error. The namespace controller does not. What is the response/content of the following endpoints:

/apis
/apis/servicecatalog.k8s.io
/apis/servicecatalog.k8s.io/v1beta1

@liggitt
Copy link

liggitt commented Aug 2, 2018

Can you fetch the non-truncated output of those endpoints when the APIService is in the unavailable state?

@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

@liggitt sure:

  1. /apis
    Response (servicecatalog is missing):
{
  "kind": "APIGroupList",
  "apiVersion": "v1",
  "groups": [
    {
      "name": "apiregistration.k8s.io",
      "versions": [
        {
          "groupVersion": "apiregistration.k8s.io/v1",
          "version": "v1"
        },
        {
          "groupVersion": "apiregistration.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "apiregistration.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "extensions",
      "versions": [
        {
          "groupVersion": "extensions/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "extensions/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "apps",
      "versions": [
        {
          "groupVersion": "apps/v1",
          "version": "v1"
        },
        {
          "groupVersion": "apps/v1beta2",
          "version": "v1beta2"
        },
        {
          "groupVersion": "apps/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "apps/v1",
        "version": "v1"
      }
    },
    {
      "name": "events.k8s.io",
      "versions": [
        {
          "groupVersion": "events.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "events.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "authentication.k8s.io",
      "versions": [
        {
          "groupVersion": "authentication.k8s.io/v1",
          "version": "v1"
        },
        {
          "groupVersion": "authentication.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "authentication.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "authorization.k8s.io",
      "versions": [
        {
          "groupVersion": "authorization.k8s.io/v1",
          "version": "v1"
        },
        {
          "groupVersion": "authorization.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "authorization.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "autoscaling",
      "versions": [
        {
          "groupVersion": "autoscaling/v1",
          "version": "v1"
        },
        {
          "groupVersion": "autoscaling/v2beta1",
          "version": "v2beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "autoscaling/v1",
        "version": "v1"
      }
    },
    {
      "name": "batch",
      "versions": [
        {
          "groupVersion": "batch/v1",
          "version": "v1"
        },
        {
          "groupVersion": "batch/v1beta1",
          "version": "v1beta1"
        },
        {
          "groupVersion": "batch/v2alpha1",
          "version": "v2alpha1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "batch/v1",
        "version": "v1"
      }
    },
    {
      "name": "certificates.k8s.io",
      "versions": [
        {
          "groupVersion": "certificates.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "certificates.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "networking.k8s.io",
      "versions": [
        {
          "groupVersion": "networking.k8s.io/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "networking.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "policy",
      "versions": [
        {
          "groupVersion": "policy/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "policy/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "rbac.authorization.k8s.io",
      "versions": [
        {
          "groupVersion": "rbac.authorization.k8s.io/v1",
          "version": "v1"
        },
        {
          "groupVersion": "rbac.authorization.k8s.io/v1beta1",
          "version": "v1beta1"
        },
        {
          "groupVersion": "rbac.authorization.k8s.io/v1alpha1",
          "version": "v1alpha1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "rbac.authorization.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "storage.k8s.io",
      "versions": [
        {
          "groupVersion": "storage.k8s.io/v1",
          "version": "v1"
        },
        {
          "groupVersion": "storage.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "storage.k8s.io/v1",
        "version": "v1"
      }
    },
    {
      "name": "admissionregistration.k8s.io",
      "versions": [
        {
          "groupVersion": "admissionregistration.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "admissionregistration.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "apiextensions.k8s.io",
      "versions": [
        {
          "groupVersion": "apiextensions.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "apiextensions.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "scheduling.k8s.io",
      "versions": [
        {
          "groupVersion": "scheduling.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "scheduling.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "creator.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "creator.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "creator.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "reporter.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "reporter.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "reporter.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "ops-gateway.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "ops-gateway.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "ops-gateway.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "composition.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "composition.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "composition.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "formation.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "formation.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "formation.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "ops.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "ops.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "ops.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "orchestration.voyager.atl-paas.net",
      "versions": [
        {
          "groupVersion": "orchestration.voyager.atl-paas.net/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "orchestration.voyager.atl-paas.net/v1",
        "version": "v1"
      }
    },
    {
      "name": "smith.atlassian.com",
      "versions": [
        {
          "groupVersion": "smith.atlassian.com/v1",
          "version": "v1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "smith.atlassian.com/v1",
        "version": "v1"
      }
    },
    {
      "name": "bitnami.com",
      "versions": [
        {
          "groupVersion": "bitnami.com/v1alpha1",
          "version": "v1alpha1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "bitnami.com/v1alpha1",
        "version": "v1alpha1"
      }
    },
    {
      "name": "clusterregistry.k8s.io",
      "versions": [
        {
          "groupVersion": "clusterregistry.k8s.io/v1alpha1",
          "version": "v1alpha1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "clusterregistry.k8s.io/v1alpha1",
        "version": "v1alpha1"
      }
    },
    {
      "name": "contour.heptio.com",
      "versions": [
        {
          "groupVersion": "contour.heptio.com/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "contour.heptio.com/v1beta1",
        "version": "v1beta1"
      }
    },
    {
      "name": "metrics.k8s.io",
      "versions": [
        {
          "groupVersion": "metrics.k8s.io/v1beta1",
          "version": "v1beta1"
        }
      ],
      "preferredVersion": {
        "groupVersion": "metrics.k8s.io/v1beta1",
        "version": "v1beta1"
      }
    }
  ]
}
  1. /apis/servicecatalog.k8s.io
    Returns 404

  2. /apis/servicecatalog.k8s.io/v1beta1
    Returns 503

@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

k8s exact version:

Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.1", GitCommit:"b1b29978270dc22fecc592ac55d903350454310a", GitTreeState:"clean", BuildDate:"2018-07-17T18:43:26Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}

@nilebox
Copy link
Contributor Author

nilebox commented Aug 2, 2018

From kubectl api-versions -v=8 I see that it's using /api and /apis endpoints. The second is missing servicecatalog.k8s.io API group in the response .

Namespace controller does the same (plus extra methods based on the list returned in /apis):

  1. Function for fetching a list is set there: https://github.com/kubernetes/kubernetes/blob/master/cmd/kube-controller-manager/app/core.go#L265

  2. Which fetches a list of API groups in the first line: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L257

  3. Which internally invokes /api and then /apis: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/discovery/discovery_client.go#L149

@nilebox nilebox added the kind/bug Categorizes issue or PR as related to a bug. label Aug 3, 2018
@nilebox nilebox added this to the 1.0.0 milestone Aug 4, 2018
k8s-publishing-bot added a commit to kubernetes/kube-aggregator that referenced this issue Aug 8, 2018
Automatic merge from submit-queue (batch tested with PRs 66394, 66888, 66932). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Include unavailable apiservices in discovery response

**What this PR does / why we need it**:
Include unavailable apiservices into `apis/` discovery endpoint response to fix namespace deletion kubernetes-retired/service-catalog#2254

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes kubernetes-retired/service-catalog#2254

**Special notes for your reviewer**:

**Release note**:

```release-note
kube-apiserver now includes all registered API groups in discovery, including registered extension API group/versions for unavailable extension API servers.
```

Kubernetes-commit: 28d649c2f5d025c2dd3b2d3c0e39fb6ca7a5b527
sttts pushed a commit to sttts/kube-aggregator that referenced this issue Aug 9, 2018
Automatic merge from submit-queue (batch tested with PRs 66394, 66888, 66932). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Include unavailable apiservices in discovery response

**What this PR does / why we need it**:
Include unavailable apiservices into `apis/` discovery endpoint response to fix namespace deletion kubernetes-retired/service-catalog#2254

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes kubernetes-retired/service-catalog#2254

**Special notes for your reviewer**:

**Release note**:

```release-note
kube-apiserver now includes all registered API groups in discovery, including registered extension API group/versions for unavailable extension API servers.
```

Kubernetes-commit: 28d649c2f5d025c2dd3b2d3c0e39fb6ca7a5b527
@nilebox nilebox reopened this Aug 9, 2018
@nilebox
Copy link
Contributor Author

nilebox commented Aug 9, 2018

The issue is fixed in Kubernetes master, will keep this issue open until it gets cherry picked to k8s 1.11, released and we update dependency to a newer version.
See kubernetes/kubernetes#67154

@cblecker
Copy link

cblecker commented Aug 9, 2018

@nilebox Sorry -- updating my branch closed this by accident.

@nilebox
Copy link
Contributor Author

nilebox commented Sep 10, 2018

k8s 1.11.3 has just been released https://github.com/kubernetes/kubernetes/releases/tag/v1.11.3

@nilebox
Copy link
Contributor Author

nilebox commented Sep 14, 2018

We can close this issue without updating dependencies to kubernetes 1.11.3, since the bug was in the Kubernetes itself, not in the Service Catalog using the library.

If Service Catalog is running in Kubernetes 1.11.3 or 1.12+, the bug won't occur.

@nilebox nilebox closed this as completed Sep 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants