Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unknown runtime type: "nvidia" #4013

Closed
joey-wang97 opened this issue Mar 28, 2019 · 6 comments
Closed

unknown runtime type: "nvidia" #4013

joey-wang97 opened this issue Mar 28, 2019 · 6 comments
Labels
area/gpu GPU related items help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. r/2019q2 Issue was last reviewed 2019q2

Comments

@joey-wang97
Copy link

I have installed the nvidia-driver and nvidia-docker2(https://github.com/NVIDIA/nvidia-docker),
I modify the default-runtime in /etc/docker/daemon.json,

[root@localhost ~]# cat /etc/docker/daemon.json 
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

I want start minikube as runtime using nvidia, but it occurs an error

[root@localhost ~]# minikube start --container-runtime=nvidia
o   minikube v0.35.0 on linux (amd64)
!   Failed to generate config: unknown runtime type: "nvidia"

*   Sorry that minikube crashed. If this was unexpected, we would love to hear from you:
-   https://github.com/kubernetes/minikube/issues/new

@joey-wang97
Copy link
Author

I can start minikube without the option container-runtime, but the gpu resouces is empty.
I checked the log of my gpushare-device-plugin(https://github.com/AliyunContainerService/gpushare-scheduler-extender#device-plugin), it is:

[root@localhost ~]# kubectl logs gpushare-device-plugin-ds-k6srx -n kube-system
I0327 11:05:00.835040       1 main.go:18] Start gpushare device plugin
I0327 11:05:00.835339       1 gpumanager.go:28] Loading NVML
I0327 11:05:00.835626       1 gpumanager.go:31] Failed to initialize NVML: could not load NVML library.
I0327 11:05:00.835699       1 gpumanager.go:32] If this is a GPU node, did you set the docker default runtime to `nvidia`?

@joey-wang97
Copy link
Author

I have already restarted the docker. but it still the same error

@afbjorklund
Copy link
Collaborator

afbjorklund commented Mar 30, 2019

Minikube container runtime would still be docker, but you need to pass the nvidia runtime to docker...

minikube start --docker-opt default-runtime=nvidia

The use of the word "runtime" here is confusing, since it can refer to both docker/cri-o and runc/nvidia:

https://developer.nvidia.com/nvidia-container-runtime

@afbjorklund afbjorklund added area/gpu GPU related items kind/documentation Categorizes issue or PR as related to documentation. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 30, 2019
@joey-wang97
Copy link
Author

I executed the aboved command, but I got an error

[root@localhost ~]# minikube start --docker-opt default-runtime=nvidia
o   minikube v0.35.0 on linux (amd64)
i   Tip: Use 'minikube start -p <name>' to create a new cluster, or 'minikube delete' to delete this one.
:   Restarting existing virtualbox VM for "minikube" ...
:   Waiting for SSH access ...
-   "minikube" IP address is 192.168.99.109
-   Configuring Docker as the container runtime ...
    - opt default-runtime=nvidia
!   Failed to enable container runtime: command failed: sudo systemctl restart docker
stdout: 
stderr: Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
: Process exited with status 1

and I checked the docker status, it is:

[root@localhost ~]# systemctl status docker.service -l
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2019-03-31 21:22:32 EDT; 23min ago
     Docs: https://docs.docker.com
 Main PID: 35787 (dockerd)
    Tasks: 48
   Memory: 875.1M
   CGroup: /system.slice/docker.service
           └─35787 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

Mar 31 21:22:31 localhost.localdomain dockerd[35787]: time="2019-03-31T21:22:31.916887651-04:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
Mar 31 21:22:32 localhost.localdomain dockerd[35787]: time="2019-03-31T21:22:32.281133153-04:00" level=info msg="Loading containers: done."
Mar 31 21:22:32 localhost.localdomain dockerd[35787]: time="2019-03-31T21:22:32.310985739-04:00" level=info msg="Docker daemon" commit=774a1f4 graphdriver(s)=overlay2 version=18.09.3
Mar 31 21:22:32 localhost.localdomain dockerd[35787]: time="2019-03-31T21:22:32.311133318-04:00" level=info msg="Daemon has completed initialization"
Mar 31 21:22:32 localhost.localdomain dockerd[35787]: time="2019-03-31T21:22:32.321806040-04:00" level=info msg="API listen on /var/run/docker.sock"
Mar 31 21:22:32 localhost.localdomain systemd[1]: Started Docker Application Container Engine.
Mar 31 21:37:52 localhost.localdomain dockerd[35787]: time="2019-03-31T21:37:52.726829198-04:00" level=error msg="Download failed, retrying: read tcp 192.168.6.121:55186->104.18.125.25:443: read: connection timed out"
Mar 31 21:37:52 localhost.localdomain dockerd[35787]: time="2019-03-31T21:37:52.982655996-04:00" level=error msg="Download failed, retrying: read tcp 192.168.6.121:58090->104.18.121.25:443: read: connection timed out"
Mar 31 21:39:13 localhost.localdomain dockerd[35787]: time="2019-03-31T21:39:13.173604540-04:00" level=info msg="Pull session cancelled"
Mar 31 21:39:20 localhost.localdomain dockerd[35787]: time="2019-03-31T21:39:20.053010635-04:00" level=error msg="Not continuing with pull after error: context canceled"

I didn't get any tips uerful, excepted the download failed

@tstromberg
Copy link
Contributor

@15050050972 - This error demonstrates that the VM can't access CloudFlare to pull an image. Is this running from China?

Download failed, retrying: read tcp 192.168.6.121:58090->104.18.121.25:443: read: connection timed out

The workaround here might be configuring a proxy: https://github.com/kubernetes/minikube/blob/master/docs/http_proxy.md

@tstromberg tstromberg added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. triage/needs-information Indicates an issue needs more information in order to work on it. labels Apr 4, 2019
@tstromberg tstromberg added r/2019q2 Issue was last reviewed 2019q2 and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels May 23, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/gpu GPU related items help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/documentation Categorizes issue or PR as related to documentation. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. r/2019q2 Issue was last reviewed 2019q2
Projects
None yet
Development

No branches or pull requests

5 participants