环境
Linux
GPU Tesla K80
步骤
0. DeepBench下载
从官网 https://github.com/baidu-research/DeepBench下载DeepBench包
git方式:
git clone https://github.com/baidu-research/DeepBench
1. 编译
-
环境配置
NVIDIA benchmarks需要CUDA cuDNN MPI nccl
前三个可以直接由module导入,这里使用的是CUDA8.0 cuDNN5.1 openmpi1.10.2,nccl使用自己安装好的路径
后面出现的问题多半是这几个库的版本问题
export MODULEPATH=/BIGDATA/app/modulefiles_GPU/:/BIGDATA/app/modulefiles
module load CUDA/8.0
module load cudnn/5.1-CUDA8.0
module load openmpi/1.10.2-gcc4.9.2
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/HOME/user_name/nccl/path/lib
从DeepBench目录下进入NVIDIA目录
cd code/nvidia
-
build
使用官网给出的build方法,build似乎可以不用yhrun,make后要加上ARCH配置
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62,sm_70
或者修改Makefile
也可以分开build,比如conv
make conv
#具体:
yhrun -n 1 make CUDA_PATH=/BIGDATA/app/CUDA/8.0 CUDNN_PATH=/BIGDATA/app/cuDNN/5.1-CUDA8.0 MPI_PATH=/BIGDATA/app/openmpi/1.10.2-gcc4.9.2 NCCL_PATH=/HOME/user_name/nccl ARCH=sm_30,sm_32,sm_35,sm_50,sm_52,sm_60,sm_61,sm_62 conv
build 成功
mkdir -p bin
/BIGDATA/app/CUDA/8.0/bin/nvcc conv_bench.cu -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /BIGDATA/app/CUDA/8.0/include -I /BIGDATA/app/cuDNN/5.1-CUDA8.0/include/ -L /BIGDATA/app/cuDNN/5.1-CUDA8.0/lib64/ -L /BIGDATA/app/CUDA/8.0/lib64 -lcurand -lcudnn --generate-code arch=compute_30,code=sm_30 --generate-code arch=compute_32,code=sm_32 --generate-code arch=compute_35,code=sm_35 --generate-code arch=compute_50,code=sm_50 --generate-code arch=compute_52,code=sm_52 --generate-code arch=compute_60,code=sm_60 --generate-code arch=compute_61,code=sm_61 --generate-code arch=compute_62,code=sm_62 -std=c++11
运行前设置好LD_LIBRARY
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/BIGDATA/app/CUDA/8.0:/BIGDATA/app/cuDNN/5.1-CUDA8.0:/BIGDATA/app/PGIcompiler/17.1/linux86-64/2017/mpi/openmpi-1.10.2:/HOME/user_name/nccl
2. 运行测试
-
gemm benchmark
nvidia目录下
yhrun -n 1 ./bin/gemm_bench
CUDA8.0 cudnn5.1 配置下运行会报错,由于CUDA是天河配置好的,我不会改
terminate called after throwing an instance of 'std::runtime_error'
what(): sgemm failed
1760 16 1760 0 0
halfyhrun: error: gn26: task 0: Aborted (core dumped)
CUDA7.0 cudnn4.0 配置可以正常运行
一部分结果
### CUDA7.0 cudnn4.0 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
m n k a_t b_t precision time (usec)
1760 16 1760 0 0 float 340 .
...
略
-
conv benchmark
nvidia目录下
yhrun -n 1 ./bin/conv_bench
CUDA8.0 cudnn6.0 可编译但无法运行
CUDA7.0 cudnn4.0 无法编译,会提示缺很多东西,可能是版本过老
CUDA8.0 cudnn5.1 配置运行中途会报错:运行到第11个算例时出现runtime_error导致运行中止
Illegal algorithm passed to get_fwd_algo_string. Algo: 7
把conv_bench.cu文件中的std::string get_fwd_algo_string()函数中最后一部分的
else {
std::stringstream ss;
ss << "Illegal algorithm passed to get_fwd_algo_string. Algo: " << fwd_algo_ << std::endl;
throw std::runtime_error(ss.str());
}
改成
else {
return "#unknown"
}
重新编译后再运行,即可越过有问题的段落,第11个显示的是unknown,后面还有好多unknown
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
w h c n k f_w f_h pad_w pad_h stride_w stride_h precision fwd_time (usec) bwd_inputs_time (usec) bwd_params_time (usec) total_time (usec) fwd_algo
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
700 161 1 4 32 20 5 0 0 2 2 float 929 1136 1074 3139 IMPLICIT_GEMM
700 161 1 8 32 20 5 0 0 2 2 float 1587 2168 1928 5683 IMPLICIT_GEMM
700 161 1 16 32 20 5 0 0 2 2 float 2813 4337 3508 10658 IMPLICIT_PRECOMP_GEMM
700 161 1 32 32 20 5 0 0 2 2 float 6368 8659 6899 21926 IMPLICIT_GEMM
341 79 32 4 32 10 5 0 0 2 2 float 2174 4076 2506 8756 IMPLICIT_PRECOMP_GEMM
341 79 32 8 32 10 5 0 0 2 2 float 4211 8128 5007 17346 IMPLICIT_PRECOMP_GEMM
341 79 32 16 32 10 5 0 0 2 2 float 8459 16200 9985 34644 IMPLICIT_PRECOMP_GEMM
341 79 32 32 32 10 5 0 0 2 2 float 16903 32380 20188 69471 IMPLICIT_PRECOMP_GEMM
480 48 1 16 16 3 3 1 1 1 1 float 752 1014 1515 3281 IMPLICIT_GEMM
240 24 16 16 32 3 3 1 1 1 1 float 863 1332 1258 3453 IMPLICIT_GEMM
120 12 32 16 64 3 3 1 1 1 1 float 613 652 1005 2270 #unknown
...
略
-
rnn benchmark
nvidia目录下
yhrun -n 1 ./bin/rnn_bench
CUDA8.0 cudnn5.1 配置下可正常运行
### CUDA8.0 cudnn5.1 openmpi1.10.2 nccl1 ###
Running training benchmark
Times
----------------------------------------------------------------------------------------
type hidden N timesteps precision fwd_time (usec) bwd_time (usec)
vanilla 1760 16 50 float 19590 17450
vanilla 1760 32 50 float 18289 18044
...
lstm 512 16 25 float 3888 5551
lstm 512 32 25 float 3922 5603
...
gru 2816 32 1500 float 2638524 2475404
gru 2816 32 750 float 1319982 1240556
...
略
-
all reduce benchmark
nccl_single_all_reduce
nvidia目录下
yhrun -n 1 ./bin/nccl_single_all_reduce 2
可以正常运行
NCCL AllReduce
Num Ranks: 2
---------------------------------------------------------------------------
# of floats bytes transferred Time (msec)
---------------------------------------------------------------------------
100000 400000 0.109
3097600 12390400 1.344
...
略
nccl_mpi_all_reduce
nvidia目录下
yhrun -n 2 -N 2 mpirun -np 2 ./bin/nccl_mpi_all_reduce
可以运行但无结果,我在那个目录下有报错提示缺失的文件,不知为什么会这样报错
mca: base: component_find: unable to open /BIGDATA/app/openmpi/1.10.2-gcc4.9.2/lib/openmpi/mca_btl_scif: libscif.so.0: cannot open shared object file: No such file or directory (ignored)
3. 使用yhbatch测试
由于测试时间长,VPN总掉线,可以使用yhbatch来运行
创建一个test.sh,文件test.sh内容如下:
#! /bin/bash
yhrun -n xx xxx_bench (yhrun语句)
再使用yhbatch命令
yhbatch -n 1 ./test.sh
这样即可将任务提交上去
任务完成后会有一个slurm_jobid.out文件,原本输出到控制台的语句都可以在这里找到