Ubuntu下nvidia-smi突然无法执行,报错`NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA ...

问题描述

系统是ubuntu 24.4,其中运行的docker容器突然无法启动,重启时提示Error response from daemon: Cannot restart container wusongdama: could not select device driver "" with capabilities: [[gpu]],执行nvidia-smi报错NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

问题原因

ubuntu系统内核自动更新,导致NVIDIA 驱动无法与内核匹配,使得无法运行.

解决办法

卸载旧的驱动,安装新的

sudo apt purge nvidia-* -y
sudo apt autoremove -y
sudo apt install nvidia-driver-560 -y

在执行完上述步骤后,执行nvidia-smi正常输出,但是docker重启仍然报错,这时候需要重新安装nvidia-container-toolkit

sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

执行完成后,问题得以解决

禁用ubuntu的自动升级

systemctl stop unattended-upgrades
systemctl disable unattended-upgrades
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容