Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't start k3s from private registry #2093

Closed
ghost opened this issue Aug 4, 2020 · 17 comments
Closed

Can't start k3s from private registry #2093

ghost opened this issue Aug 4, 2020 · 17 comments

Comments

@ghost
Copy link

ghost commented Aug 4, 2020

I am trying to start a single-node k3s server on centos7 in an air-gapped environment. I can start it using the manual deployed method where you put the tar ball of docker images in /var/lib/rancher/k3s/agent/images.
But I want to put the images on my secured, private docker registry and deploy that way.
I created the registries.yaml file that goes in /etc/rancher/k3s/ and included the FQDN of my registry and the path to the appropriate certificates. SELinux is disabled, and I have tried it in permissive.
When I run the install script like this:
INSTALL_K3S_SKIP_DOWNLOAD=true ./install.sh
It runs successfully but if I check the status of the pods they are in the "CrashLoopBackOff" state.
From there I run :
k3s kubectl describe pods --all-namepsaces
To see what is the problem with the pods and this is the output I see:

Name:         coredns-8655855d6-8chct
Namespace:    kube-system
Priority:     0
Node:         hostname.sys/192.168.0.1
Start Time:   Tue, 04 Aug 2020 17:22:27 +0000
Labels:       k8s-app=kube-dns
              pod-template-hash=8655855d6
Annotations:  <none>
Status:       Running
IP:           10.42.0.227
IPs:
  IP:           10.42.0.227
Controlled By:  ReplicaSet/coredns-8655855d6
Containers:
  coredns:
    Container ID:  containerd://a7eb5f94da53e7243fe9c1aee51dcce00a6e36ff4fa61ffab795f3d1a5efcf26
    Image:         rancher/coredns-coredns:1.6.3
    Image ID:      docker.io/rancher/coredns-coredns@sha256:7eb40906c31a1610d9c1aeb5c818da5f68029f3e772ac226e2eac67965537017
    Ports:         53/UDP, 53/TCP, 9153/TCP
    Host Ports:    0/UDP, 0/TCP, 0/TCP
    Args:
      -conf
      /etc/coredns/Corefile
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/7248/ns/ipc\" caused \"lstat /proc/7248/ns/ipc: no such file or directory\"": unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 00:00:00 +0000
      Finished:     Tue, 04 Aug 2020 17:47:52 +0000
    Ready:          False
    Restart Count:  10
    Limits:
      memory:  170Mi
    Requests:
      cpu:        100m
      memory:     70Mi
    Liveness:     http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
    Readiness:    http-get http://:8181/ready delay=10s timeout=5s period=10s #success=1 #failure=5
    Environment:  <none>
    Mounts:
      /etc/coredns from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-9snxh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      coredns
    Optional:  false
  coredns-token-9snxh:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  coredns-token-9snxh
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     CriticalAddonsOnly
                 node-role.kubernetes.io/master:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason          Age                   From                          Message
  ----     ------          ----                  ----                          -------
  Normal   Scheduled       <unknown>             default-scheduler             Successfully assigned kube-system/coredns-8655855d6-8chct to hostname.sys
  Normal   Pulling         27m                   kubelet, hostname.sys  Pulling image "rancher/coredns-coredns:1.6.3"
  Normal   Pulled          27m                   kubelet, hostname.sys  Successfully pulled image "rancher/coredns-coredns:1.6.3"
  Warning  Failed          27m                   kubelet, hostname.sys  Error: failed to get sandbox container task: no running task found: task 6591f7ce943a7344e25cee3637250bcef1f564fb20d1d6c75e97b5055346c992 not found: not found
  Warning  Failed          26m                   kubelet, hostname.sys  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/9260/ns/ipc\" caused \"lstat /proc/9260/ns/ipc: no such file or directory\"": unknown
  Normal   Pulled          26m (x2 over 26m)     kubelet, hostname.sys  Container image "rancher/coredns-coredns:1.6.3" already present on machine
  Normal   Created         26m (x2 over 26m)     kubelet, hostname.sys  Created container coredns
  Warning  Failed          26m                   kubelet, hostname.sys  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/9857/ns/ipc\" caused \"lstat /proc/9857/ns/ipc: no such file or directory\"": unknown
  Warning  BackOff         7m3s (x647 over 26m)  kubelet, hostname.sys  Back-off restarting failed container
  Normal   SandboxChanged  113s (x742 over 27m)  kubelet, hostname.sys  Pod sandbox changed, it will be killed

Anyone have any clue as to why the sandbox container is not starting and what I need to so to fix this?

@brandond
Copy link
Member

brandond commented Aug 4, 2020

  1. Please fill out the issue template so that we can capture all the relevant information about your environment.
  2. How did you go about pushing all the airgap images to your private mirror?
  3. What is the content of your registries.yaml?
  4. Are there any interesting errors in /var/lib/rancher/k3s/agent/containerd/containerd.log?

@ghost
Copy link
Author

ghost commented Aug 4, 2020

@brandond, Thank you for your quick response!

  1. Where is the issue template?
  2. I did the following to get the airgap images into my registry:
    docker load -i k3s-airgap-images-amd64.tar
    docker tag docker.io/rancher/<image-name> registry.sys/rancher/<image-name>
    docker push registry/rancher/<image-name>
  3. registries.yaml file looks like this:
mirrors:
  docker.io:
    endpoint:
      - "https://registry.sys"
configs:
  "registry.sys":
    auth:
      username: username 
      password: password 
    tls:
      cert_file: /path/certs/ca.cert # path to the cert file used in the registry
      key_file:  /path/certs/ca.key  # path to the key file used in the registry
      ca_file:   /apth/certs/ca.crt  # path to the ca file used in the registry
  1. Errors I noticed in /var/lib/rancher/k3s/agent/containerd/containerd.log:
time="2020-08-04T18:46:43.677588013Z" level=error msg="StopPodSandbox for \"2f6c1a69f2d7c3c3037985280c44a689816fa23d36053a0d065aff7bd5eed278\" failed" error="an error occurred when try to find sandbox \"2f6c1a69f2d7c3c3037985280c44a689816fa23d36053a0d065aff7bd5eed278\": does not exist"
time="2020-08-04T18:46:43.678067097Z" level=info msg="StopPodSandbox for \"7bd1e84d73d0d39c2d9135b06965b20eef08c33fcf32e2384becf0877cf0bd8c\""
time="2020-08-04T18:46:43.678165023Z" level=error msg="StopPodSandbox for \"7bd1e84d73d0d39c2d9135b06965b20eef08c33fcf32e2384becf0877cf0bd8c\" failed" error="an error occurred when try to find sandbox \"7bd1e84d73d0d39c2d9135b06965b20eef08c33fcf32e2384becf0877cf0bd8c\": does not exist"
time="2020-08-04T18:46:43.678754547Z" level=info msg="StopPodSandbox for \"821f4605c0d085c29d209868fe9819c94bb986d7097565935cd2f76bc2d48a14\""
time="2020-08-04T18:46:43.678835483Z" level=error msg="StopPodSandbox for \"821f4605c0d085c29d209868fe9819c94bb986d7097565935cd2f76bc2d48a14\" failed" error="an error occurred when try to find sandbox \"821f4605c0d085c29d209868fe9819c94bb986d7097565935cd2f76bc2d48a14\": does not exist"
time="2020-08-04T18:46:43.679745796Z" level=info msg="StopPodSandbox for \"1b3d3b450249b7c16b058da23bbdcdb7f785c922338dfa93f9268b8fb288a3f3\""
time="2020-08-04T18:46:43.679810285Z" level=error msg="StopPodSandbox for \"1b3d3b450249b7c16b058da23bbdcdb7f785c922338dfa93f9268b8fb288a3f3\" failed" error="an error occurred when try to find sandbox \"1b3d3b450249b7c16b058da23bbdcdb7f785c922338dfa93f9268b8fb288a3f3\": does not exist"
time="2020-08-04T18:46:43.680307555Z" level=info msg="StopPodSandbox for \"fbb8d7669714957d4715ea5b6d79a1d60d6fb9501a02010b9d91a54e8f32a7fc\""
time="2020-08-04T18:46:43.680372352Z" level=error msg="StopPodSandbox for \"fbb8d7669714957d4715ea5b6d79a1d60d6fb9501a02010b9d91a54e8f32a7fc\" failed" error="an error occurred when try to find sandbox \"fbb8d7669714957d4715ea5b6d79a1d60d6fb9501a02010b9d91a54e8f32a7fc\": does not exist"

All the other logged messages in this log file seemed to be only 'info' level logs messages

@brandond
Copy link
Member

brandond commented Aug 4, 2020

The issue template is the bit of content that you see when you click 'New issue' and select 'Bug report' instead of clicking the option to create a new blank issue. You can find it here:
https://github.com/rancher/k3s/issues/new?assignees=&labels=&template=bug_report.md&title=

For the configs section, can you try specifying a URI that matches the endpoint, vs just the hostname? See: https://github.com/containerd/cri/blob/master/docs/registry.md#configure-registry-credentials . that would look something like this:

mirrors:
  docker.io:
    endpoint:
      - "https://registry.sys"
configs:
  "https://registry.sys":
    auth:
      username: username 
      password: password 
    tls:
      cert_file: /path/certs/ca.cert # path to the cert file used in the registry
      key_file:  /path/certs/ca.key  # path to the key file used in the registry
      ca_file:   /apth/certs/ca.crt  # path to the ca file used in the registry

The private registry docs could use some work - both on our side and in containerd. I'm hoping to get to that shortly.

Related docs tracking issue: #1802

@brandond
Copy link
Member

brandond commented Aug 4, 2020

Your issue also sounds similar to this, although it sounds like you are not using multiarch. I wonder if you may need to take some extra steps to get all the right layers tagged and pushed: #1285

@ghost
Copy link
Author

ghost commented Aug 4, 2020

Environmental Info:
K3s Version:

k3s version v1.18.4+k3s1 (97b7a0e)

Node(s) CPU architecture, OS, and Version:

Linux hostname.sys 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

Single node configuration

Describe the bug:

K3s does not start when using a secured, private registry
Steps To Reproduce:

  • Retagged and pushed all the k3s-airgap images to my private docker registry
  • Create a registries.yaml file (contents above) in /etc/rancher/k3s/
  • Run INSTALL_K3S_SKIP_DOWNLOAD=true ./install.sh
  • Witness that the pods have not started

Expected behavior:

  • Output of k3s kubectl get pods --all-namespaces should be:
NAMESPACE     NAME                                     READY   STATUS             RESTARTS   AGE
kube-system   local-path-provisioner-6d59f47c7-mjrzs   1/1     Running            0          10m
kube-system   metrics-server-7566d596c8-lkdnc          1/1     Running            0          10m
kube-system   helm-install-traefik-tfjjh               0/1     Completed          1          10m
kube-system   svclb-traefik-zjj5s                      2/2     Running            0          10m
kube-system   coredns-8655855d6-xsjv7                  1/1     Running            0          10m
kube-system   traefik-758cd5fc85-lrmqr                 1/1     Running            0          10m

Actual behavior:

  • When I run k3s kubectl get pods --all-namespaces:
NAMESPACE     NAME                                     READY   STATUS             RESTARTS   AGE
kube-system   helm-install-traefik-89rjr               0/1     CrashLoopBackOff   36         158m
kube-system   local-path-provisioner-6d59f47c7-cwzs8   0/1     CrashLoopBackOff   36         158m
kube-system   coredns-8655855d6-8chct                  0/1     CrashLoopBackOff   36         158m
kube-system   metrics-server-7566d596c8-jgwm4          0/1     CrashLoopBackOff   36         158m

This behavior prevents me from using any of the k3s features

Additional context / logs:

@ghost
Copy link
Author

ghost commented Aug 4, 2020

It is possible this is a certificates issue. I am not very familiar with how to use certs properly, I have just followed a couple of online tutorials to get to where I am.
I have a docker registry container running with a reverse Nginx proxy (also a container) to secure it. I created a domain.key and a domain.crt with an openssl command and those are mounted to the Nginx contatiner. Then I copied the .crt and .key file to my k3s host and that is what I am referencing in my registries.yaml file. Should I be making an entirely new .crt file for each new host? I'm new to this and really appreciate your attention!

@brandond
Copy link
Member

brandond commented Aug 4, 2020

You only need a cert_file and key_file if you want to use a client certificate to authenticate to the registry. You should NOT be providing the registry's HTTPS cert and key as cert_file and key_file. If you're using a self-signed cert on the registry's https endpoint, then you can provide that cert as the ca_file in order to avoid certificate validation failures.

For an internal registry, you can also use http instead of https and skip messing about with certs at all.

@ghost
Copy link
Author

ghost commented Aug 4, 2020

It is an internal registry, however, as of now it is a req that it is secured. Maybe I'll be able to persuade otherwise.
I tried to only provide the ca_file in the registries.yaml file like you suggested. I took it to mean that my proxy's .cert file is what I would provide as the value for the ca_file.
I think it maybe makes more progress... but it fails as well. The output form kubectl describe show that this:
failed to pull and unpack image "docker.io/rancher/pause:3.1": failed to resolve reference "docker.io/rancher/pause:3.1": failed to do request: Head https://registry.sys/v2/rancher/pause/manifests/3.1: x509: certificate signed by unknown authority

Do you know why this would be?

I ran a sanity check with curl:
curl -u username --cacert ../certs/ca.cert https://registry.sys/v2/_catalog
Where ../certs/ca/cert is the cert file from my registry's proxy. This curl command succeeds and displays the available images to pull as expected.

@brandond
Copy link
Member

brandond commented Aug 4, 2020

Are you using the same ca.cert file in the curl command as you are specifying as ca_file in your registries.yaml? Is the cert on the registry self-signed, or signed by another private CA? Do you see any different behavior if you specify

configs:
  "https://registry.sys":

vs

configs:
  "registry.sys":

@ghost
Copy link
Author

ghost commented Aug 4, 2020

Yes, the ca.cert file is the same between the registries.yaml and the curl command.
When I use 'https://registry.sys' I get the certificate signed by unknown authority error
When I use'registry.sys' I get the original issue where the sandbox container task is not found
To me it looks like including the 'https://' is the way to go.
The cert was created with this openssl command:
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout domain.key -out domain.crt

Also I should mention that the certificate signed by unknown authority error occurs when k3s is trying to pull the rancher/pause image, which to my understanding is the sandbox container image, right?

@brandond
Copy link
Member

brandond commented Aug 5, 2020

Yes, that would be the sandbox image. It should use the same settings as all the other image pulls, but I'm not sure why it's not liking the certificate. upstream containerd has had some issues in this space for a while, so you may be running in to that. Just for testing purposes, you may see if you can get it working via http instead of https, then try to turn on https.

@ghost
Copy link
Author

ghost commented Aug 7, 2020

@brandond thanks for your help and suggestions! I went ahead and tried to install just using http but it didn't work and I actually ended up with a error I had seen before where the sandbox image doesn't start properly.
Seems like there are some issues with the registry deploying method at this stage, so I may just opt for the manual deploying method for the time being.

@brandond
Copy link
Member

brandond commented Aug 7, 2020

It looks like there are some multiarch problems with the upstream images. This is causing issues both for arm users, as well as folks trying to push the airgap images to a private registry. Keep an eye on #1285 for updates.

@rerime
Copy link

rerime commented Sep 2, 2020

Same issue with airgap installation
k3s -v
k3s version v1.18.8+k3s1 (6b59531)

kubectl get pods -A -o wide

NAMESPACE     NAME                                     READY   STATUS             RESTARTS   AGE   IP           NODE                            NOMINATED NODE   READINESS GATES
kube-system   coredns-7944c66d8d-rznxx                 1/1     Running            0          20h   10.42.0.5    master   <none>           <none>
kube-system   local-path-provisioner-6d59f47c7-jns7d   1/1     Running            0          52m   10.42.0.8    master   <none>           <none>
kube-system   metrics-server-7566d596c8-9k29s          1/1     Running            0          51m   10.42.0.9    master   <none>           <none>
kube-system   helm-install-traefik-pcrjm               0/1     Completed          4          53m   10.42.0.7    master   <none>           <none>
kube-system   svclb-traefik-st7sq                      2/2     Running            0          51m   10.42.0.10   master   <none>           <none>
kube-system   traefik-758cd5fc85-tvzvm                 1/1     Running            0          49m   10.42.0.12   master   <none>           <none>
kube-system   svclb-traefik-nv5cr                      0/2     CrashLoopBackOff   4          15s   <none>       worker   <none>           <none>

kubectl -n kube-system describe pod svclb-traefik-nv5cr

Type     Reason          Age                    From                                    Message
  ----     ------          ----                   ----                                    -------
  Normal   Scheduled       <unknown>              default-scheduler                       Successfully assigned kube-system/svclb-traefik-nv5cr to worker
  Warning  Failed          4m56s                  kubelet, worker  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/29420/ns/ipc\" caused \"lstat /proc/29420/ns/ipc: no such file or directory\"": unknown
  Warning  Failed          4m56s                  kubelet, worker  Error: sandbox container "9af6e2a081dcb0186e69733f52ef50006f608e45beea967cc03c2851c719fc74" is not running
  Normal   Pulled          4m56s (x2 over 4m56s)  kubelet, worker  Container image "rancher/klipper-lb:v0.1.2" already present on machine
  Normal   Created         4m56s (x2 over 4m56s)  kubelet, worker  Created container lb-port-80
  Warning  Failed          4m56s                  kubelet, worker  Error: failed to create containerd task: OCI runtime create failed: container_linux.go:341: creating new parent process caused "container_linux.go:1923: running lstat on namespace path \"/proc/29605/ns/ipc\" caused \"lstat /proc/29605/ns/ipc: no such file or directory\"": unknown
  Normal   Pulled          4m56s (x2 over 4m56s)  kubelet, worker  Container image "rancher/klipper-lb:v0.1.2" already present on machine
  Normal   Created         4m56s (x2 over 4m56s)  kubelet, worker  Created container lb-port-443
  Warning  Failed          4m56s                  kubelet, worker  Error: sandbox container "2206f946cce3be6afd697dbbfb907e39e3c0af9f043120cc0ff372f119615fa3" is not running
  Normal   SandboxChanged  4m52s (x5 over 4m56s)  kubelet, worker  Pod sandbox changed, it will be killed and re-created.
  Warning  BackOff         4m52s (x4 over 4m55s)  kubelet, worker  Back-off restarting failed container
  Warning  BackOff         4m52s (x4 over 4m55s)  kubelet, worker  Back-off restarting failed container

@brandond
Copy link
Member

brandond commented Sep 2, 2020

@rerime what architecture are you using? I believe there are issues with one of the upstream arm64 images that we're hoping to resolve by the 1.19 release.

@rerime
Copy link

rerime commented Sep 7, 2020

@brandond AMD64. Ok, thx!

@stale
Copy link

stale bot commented Jul 31, 2021

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 180 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@stale stale bot added the status/stale label Jul 31, 2021
@stale stale bot closed this as completed Aug 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants