Skip to content

Commit

Permalink
Documentation for debugging CRD-based Zenith (#145)
Browse files Browse the repository at this point in the history
* Documentation for debugging CRD-based Zenith

* Add link to CRD docs

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Small tweak to wording

Co-authored-by: Matt Anson <matta@stackhpc.com>

* And another

Co-authored-by: Matt Anson <matta@stackhpc.com>

* More small tweaks

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

* Update docs/debugging/zenith-services.md

Co-authored-by: Matt Anson <matta@stackhpc.com>

---------

Co-authored-by: Matt Anson <matta@stackhpc.com>
  • Loading branch information
mkjpryor and m-bull authored May 5, 2024
1 parent aab342d commit cd78185
Showing 1 changed file with 130 additions and 44 deletions.
174 changes: 130 additions & 44 deletions docs/debugging/zenith-services.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,29 +20,74 @@ persist, try restarting the Zenith SSHD:
kubectl -n azimuth rollout restart deployment/zenith-server-sshd
```

## Client not registered in Consul
## Client not appearing in Zenith CRDs

Once a client has connected to SSHD successfully, it should get registered in
[Consul](https://www.consul.io/).
The components of Zenith communicate using three [Kubernetes CRDs](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/):

To determine if this is the case, it is useful to access the Consul UI. As discussed
in [Monitoring and alerting](../configuration/14-monitoring.md), the Consul UI
is exposed as `consul.<ingress base domain>`, e.g. `consul.azimuth.example.org`,
and is protected by a username and password.
* `services.zenith.stackhpc.com`
A reserved domain and associated SSH public key.
* `endpoints.zenith.stackhpc.com`
The current endpoints for a Zenith service.
This resource is updated to add the address, port and configuration of the Zenith SSH tunnel as the SSH tunnel is created.
* `leases.zenith.stackhpc.com`
Heartbeat information for an individual SSH tunnel.
Each Zenith SSH tunnel has its own lease resource that is regularly updated with a heartbeat.

The default view shows Consul's view of the services, where you can check if the
service is being registered correctly.
If a Zenith service is not functioning as expected, check the state of the CRDs for
that service.

Clients not registering correctly in Consul usually indicates an issue with Consul
itself. Futher information for debugging Consul issues is provided in
[Debugging Consul](consul.md).
First, check that the service exists and has an SSH key associated:

If the issue persists once Consul issues are ruled out, try restarting SSHD:
```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get services.zenith
NAME FINGERPRINT AGE
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn WLo15SbKRadA5q1WIn6dToWT4Q+j05rZ5T+Zc/so4M0 13m
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 G6sdXwUfvdlosCB2yi40TEf5//ie2bgCxytrig4xpTA 13m
```

```sh title="On the K3s node, targetting the HA cluster if deployed"
kubectl -n azimuth rollout restart deployment/zenith-server-sshd
Next, check that there is at least one lease for the service and verify that it is being
regularly renewed:

```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get leases.zenith
NAME RENEWED TTL REAP AFTER AGE
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-fn75d 7s 20 120 13m
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-7tnqm 5s 20 120 14m
```

Finally, check that the endpoint is registered correctly in the endpoints resource:

```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get endpoints.zenith igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn -o yaml
apiVersion: zenith.stackhpc.com/v1alpha1
kind: Endpoints
metadata:
creationTimestamp: "2024-05-01T13:24:12Z"
generation: 3
name: igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn
namespace: zenith-services
ownerReferences:
- apiVersion: zenith.stackhpc.com/v1alpha1
blockOwnerDeletion: true
kind: Service
name: igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn
uid: 378b39dc-9fce-4553-865e-edad2dd8d8b0
resourceVersion: "7260"
uid: 62badd6e-a5a7-452d-b346-04c2efd75a6c
spec:
endpoints:
7tnqm:
address: 10.42.0.71
config:
backend-protocol: http
skip-auth: false
port: 42109
status: passing
```

The address should be the pod IP of a Zenith SSHD pod, and the port should be the allocated port
reported by the client.

## OIDC credentials not created

Keycloak OIDC credentials for Zenith services for platforms deployed using Azimuth are created
Expand All @@ -67,45 +112,87 @@ the identity operator:
kubectl -n azimuth rollout restart deployment/azimuth-identity-operator
```

If this doesn't work, check the logs for errors:

```sh title="On the K3s node, targetting the HA cluster if deployed"
kubectl -n azimuth logs deployment/azimuth-identity-operator [-f]
```

## Kubernetes resources for the Zenith service have not been created

If the service exists in Consul, it is possible that the process that synchronises Consul
services with Kubernetes resources is not functioning correctly. To check if Kubernetes
resources are being created, run the following command and check that the `Ingress`,
`Service` and `Endpoints` resources have been created for the service:
If the CRDs for the service look correct, it is possible that the component that watches
the Zenith CRDs and creates the Kubernetes ingress for those services is not functioning
correctly.

This component creates Helm releases to deploy the resources for a service, so first check
that a Helm release exists for the service and is in the `deployed` state:

```sh title="On the K3s node, targetting the HA cluster if deployed"
$ helm -n zenith-services list -a
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn zenith-services 1 2024-05-01 13:24:13.36622944 +0000 UTC deployed zenith-service-0.1.0+846a545e main
sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 zenith-services 1 2024-05-01 13:24:41.219330845 +0000 UTC deployed zenith-service-0.1.0+846a545e main
```

Also check the state of the `Ingress`, `Service` and `EndpointSlice`s for the service:

```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get ingress,service,endpoints
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g nginx cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g.azimuth.example.org 96.241.100.96 80, 443 2d
ingress.networking.k8s.io/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw nginx i03xvflgk1zmtcsdm2x5z5lz9qz05027euw.azimuth.example.org 96.241.100.96 80, 443 2d
ingress.networking.k8s.io/pxmvy7235x2ggfvf2op615gvz2v59wkqglc nginx pxmvy7235x2ggfvf2op615gvz2v59wkqglc.azimuth.example.org 96.241.100.96 80, 443 2d
ingress.networking.k8s.io/txn3zidfdnru5rg109voh848n51rvicmr1s nginx txn3zidfdnru5rg109voh848n51rvicmr1s.azimuth.example.org 96.241.100.96 80, 443 2d

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g ClusterIP 172.27.10.109 <none> 80/TCP 2d
service/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw ClusterIP 172.30.203.227 <none> 80/TCP 2d
service/pxmvy7235x2ggfvf2op615gvz2v59wkqglc ClusterIP 172.28.51.148 <none> 80/TCP 2d
service/txn3zidfdnru5rg109voh848n51rvicmr1s ClusterIP 172.29.136.245 <none> 80/TCP 2d

NAME ENDPOINTS AGE
endpoints/cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g 172.18.152.99:34665 2d
endpoints/i03xvflgk1zmtcsdm2x5z5lz9qz05027euw 172.18.152.99:39409 2d
endpoints/pxmvy7235x2ggfvf2op615gvz2v59wkqglc 172.18.152.99:44761,172.18.152.99:36483,172.18.152.99:46449 2d
endpoints/txn3zidfdnru5rg109voh848n51rvicmr1s 172.18.152.99:45379 2d
$ kubectl -n zenith-services get ingress,service,endpointslice
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc nginx igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 25m
ingress.networking.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn nginx igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 25m
ingress.networking.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 nginx sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 24m
ingress.networking.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc nginx sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31.apps.45-135-57-238.sslip.io 192.168.3.49 80, 443 24m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc ClusterIP 10.43.247.54 <none> 80/TCP,44180/TCP 25m
service/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn ClusterIP 10.43.125.138 <none> 80/TCP 25m
service/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 ClusterIP 10.43.234.107 <none> 80/TCP 24m
service/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc ClusterIP 10.43.7.75 <none> 80/TCP,44180/TCP 24m

NAME ADDRESSTYPE PORTS ENDPOINTS AGE
endpointslice.discovery.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-3010d IPv4 42109 10.42.0.71 25m
endpointslice.discovery.k8s.io/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc-kqx8h IPv4 4180,44180 10.42.0.86 25m
endpointslice.discovery.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-f2086 IPv4 33857 10.42.0.71 24m
endpointslice.discovery.k8s.io/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc-tkvqv IPv4 4180,44180 10.42.0.88 24m
```

!!! tip
!!! tip "Ingress address not assigned"

If an ingress resource does not have an IP, this may be a sign that the ingress controller
If an ingress resource does not have an address, this may be a sign that the ingress controller
is not correctly configured or not functioning correctly.

If they do not exist, try restarting the Zenith sync component:
!!! info "Services with OIDC authentication"

When a service has OIDC authentication enabled, there will be two of each resource for each
service, one of which will have the suffix `-oidc`.

Each service with OIDC authentication enabled gets a standalone service that is responsible
handling the interactions with the OIDC provider. To check the state of these resources, use:

```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get deploy,po
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc 1/1 1 1 33m
deployment.apps/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc 1/1 1 1 33m

NAME READY STATUS RESTARTS AGE
pod/igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn-oidc-7f6656bd98-9rrj2 1/1 Running 0 33m
pod/sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31-oidc-7ffbff4cd6-sr75w 1/1 Running 0 33m
```

If any of these resources look incorrect, try restarting the Zenith sync component:

```sh title="On the K3s node, targetting the HA cluster if deployed"
kubectl -n azimuth rollout restart deployment/zenith-server-sync
```

If this doesn't work, check the logs for errors:

```sh title="On the K3s node, targetting the HA cluster if deployed"
kubectl -n azimuth logs deployment/zenith-server-sync [-f]
```

## cert-manager fails to obtain a certificate

If you are using cert-manager to dynamically allocate certificates for Zenith services it
Expand All @@ -116,10 +203,9 @@ To check if this is the case, check the state of the certificates for the Zenith

```command title="On the K3s node, targetting the HA cluster if deployed"
$ kubectl -n zenith-services get certificate
NAME READY SECRET AGE
tls-cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g True tls-cjzm03yczuj6oqrj3h8htl4u1bbx96qd53g 2d
tls-i03xvflgk1zmtcsdm2x5z5lz9qz05027euw True tls-i03xvflgk1zmtcsdm2x5z5lz9qz05027euw 2d
tls-txn3zidfdnru5rg109voh848n51rvicmr1s True tls-txn3zidfdnru5rg109voh848n51rvicmr1s 2d
NAME READY SECRET AGE
tls-igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn True tls-igxvo2okpkq834d1qbgtlhmm6xo4laj0dupn 30m
tls-sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 True tls-sh20tp1071hl3xtjw5cj4mwdy5t0v7qodj31 29m
```

If the certificate for the service is not ready, check the details for the certificate using
Expand Down

0 comments on commit cd78185

Please sign in to comment.