DELL R750 ubuntu 20.04 安装nvidia tesla a100 80G pcie 驱动
为了解决网上完全没有相关成品解决方案的问题特此编写此文档
环境
NVIDIA A100 80GB PCIe GPU
Dell Poweredge R750
ubuntu 20.04
前期需要确认条件
- 服务器尝试安装了windows并安装驱动可以成功输出nvidia-smi,证明此显卡服务器支持
- NVIDIA A100 80GB PCIe GPU 安装在 pcie gen4 x16插槽上
- BIOS secure boot is disabled
- DRAC Version 5.00.10.20 was added support for NVIDIA A100 80GB PCIe GPU in PowerEdge R750, PowerEdge R750xa, and PowerEdge R7525:
- NVIDIA A100 installed in pcie slot 7 or 2
问题复现
user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.
#并且NVIDIA-Linux-x86_64-515.65.07.run研究3天装不上直接放弃抵抗,gcc/g++版本切换,ubuntu内核更改,所有网上办法全都使用过,直接躺平不研究
解决方案
驱动安装
apt-get install nvidia-driver-515
#也可以通过图形化附加驱动的方法安装
安装完毕后报错
user@user:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the lat est NVIDIA driver is installed and running.
编辑/etc/default/grub
#编辑/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"
上载grub
user@user:~$ sudo update-grub
重启
reboot
见证奇迹的时刻
user@user:~$ nvidia-smi
Wed Nov 16 13:04:49 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:17:00.0 Off | 0 |
| N/A 36C P0 67W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... Off | 00000000:CA:00.0 Off | 0 |
| N/A 36C P0 63W / 300W | 0MiB / 81920MiB | 2% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
过程分析
万恶之源/梦开始的地方
#关键字段
Ubuntu Server 20.04 LTS for Dell EMC PowerEdge Servers Release Notes
#解决方法的字段
pci=realloc=off
寻找解决方案原文
寻找pci=realloc=off字段得知是内核相关
https://blog.csdn.net/liuzq/article/details/89682079
得知内核的命令行相关信息(/proc/cmdline)
https://www.kernel.org/doc/html/v4.14/admin-guide/kernel-parameters.html
寻找内核增加参数方法
https://linux.cn/article-2268-1.html
#编辑/etc/default/grub
user@user:~$ sudo vim /etc/default/grub
#增加pci=realloc=off到GRUB_CMDLINE_LINUX=""
GRUB_CMDLINE_LINUX="pci=realloc=off"
#更新grub
user@user:~$ sudo update-grub