参考:https://github.com/NVIDIA/nccl
1. 安装 nccl
(1)将nccl文件夹git clone 下来
yum -y install git
git clone https://github.com/NVIDIA/nccl.git
(2)build
cd nccl
有三种选择
i. 直接build
make -j src.build
ii. 改路径:CUDA_HOME 默认是 /usr/local/cuda
make src.build CUDA_HOME=<path to cuda install>
iii. 只安装GTX 1080 TI的框架,不会那么大
make -j src.build NVCC_GENCODE="-gencode=arch=compute_61,code=sm_61"
(3)install (CentOS系统)
sudo yum install rpm-build rpmdevtools
make pkg.redhat.build
ls build/pkg/rpm/
2. nccl test
(1)先把nccl-test的文件git clone下来,进入
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
(2)make
make NCCL_HOME=/root/nccl/build
(make CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl )
先看看cuda能不能用(nvcc -V)
vim ~/.bashrc
export LD_LIBRARY_PATH="/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH"
export PATH="/usr/local/cuda-9.0/bin:$PATH"
source ~/.bashrc
(3)试一下
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
报错1
./build/all_reduce_perf: error while loading shared libraries: libcudart.so.9.0: cannot open shared object file: No such file or directory
解决方案 (先让nvcc 能用,再复制文件)
[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcudart.so.9.0 /usr/local/lib/libcudart.so.9.0 && sudo ldconfig
[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcublas.so.9.0 /usr/local/lib/libcublas.so.9.0 && sudo ldconfig
[root@localhost nccl-tests]# sudo cp /usr/local/cuda-9.0/lib64/libcurand.so.9.0 /usr/local/lib/libcurand.so.9.0 && sudo ldconfig
报错2
[root@localhost nccl-tests]# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
./build/all_reduce_perf: error while loading shared libraries: libnccl.so.2: cannot open shared object file: No such file or directory
解决方案
[root@localhost nccl-tests]# sudo cp /root/nccl/build/lib/libnccl.so.2 /usr/lib/libnccl.so.2 && sudo ldconfig
3. 运行代码时的问题
python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 --b 4
python main.py data -a resnet50 --dist-url 'tcp://10.141.221.203:203' --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 1 --b 4
(1)关闭防火墙
systemctl status firewalld.service
systemctl stop firewalld.service
(2)配置环境变量
export NCCL_SOCKET_IFNAME=em1
网卡位置:/etc/sysconfig/network-scripts/ifcfg-em1
在主机上如果把world-size改成1,是可以跑的~说明是辅机连接到主机有问题