-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The installation of Charmed kubernetes with GPU as local couldn't ended #830
Comments
Hi, sorry you are having an issue, it does look like contained is getting stuck in a loop preventing the nodes from coming up., which could I guess be caused by the GPU driver. The kubernetes-worker charm automatically downloads the required drivers which may be causing the issue if it has been pre-installed. Perhaps @kwmonroe may have some insights here In the meantime it may be worth trying to set contained to ignore the GPU to confirm that is the issue:
or trying again without pre-installing the drivers. |
Well, if the installation is without installing GPU driver and CUDA, the process is ended normally. `ubuntu@ip-10-10-11-82:~$ juju status App Version Status Scale Charm Channel Rev Exposed Message Unit Workload Agent Machine Public address Ports Message Machine State Address Inst id Base AZ Message |
@iiot-architect can you provide some details on your instance? i just deployed a g5.xlarge and got:
the charmed k8s bundle is pretty heavy weight -- especially deployed to lxd. i doubt 4 cores and 16g ram will be enough, but i'm positive 8G root filesystem won't be :) is it possible you've run out of disk space on your instance? |
Dear kwmonroe. No, the disk space is no problem. |
According to the official blog, I think that NVIDIA Driver and CUDA should be installed to the host in advance: |
I seem that the configuration about containerd isn't effective:
|
That blog post is 6 years old so I'm not sure how much of it is reliable any more. |
Dear evilnick
Well, I seem that it's irrelevant.
In addition, I changed the instance type from g5.xlarge to g4ad.2xlarge with the advanced installation of the driver but not almost changed the result. |
Dear kwmonroe. Thanks for your help.
Sure, the configuration process was ended normally.
And I added GPU to each Lxds of the workers but don't changed:
|
I'm trying the installation of Charmed Kubernetes with NVIDIA GPU on an Amazon EC2 instance(g5.xlarge) as local:
However I seem that the process isn't ended for over 3 hours:
kubernetes-control-plane is repeatedly showing the message between 'Restarting snap.kubelet.daemon service' and 'Waiting for 4 kube-system pods to start'.
Also containerd is repeatedly showing the message between 'Unpacking containerd resource' and 'containerd resource binary containerd-stress failed a version check' as well.
The instance was installed the following software before the installation process:
NVIDIA GPU Driver:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
NVIDIA CUDA:
https://us.download.nvidia.com/tesla/535.154.05/nvidia-driver-local-repo-ubuntu2204-535.154.05_1.0-1_amd64.deb
And I tried version 1.28/stable and 1.27/stable but the symptoms was almost same.
How can I improve this problem?
The text was updated successfully, but these errors were encountered: