安装Nvidia/apex 踩过的巨坑
1.首先要检查系统的cuda版本与pytorch的cuda版本是否一致,如果不一致,会造成apex安装不成功。
(sent) [root@117d5c68ae2a finetuning_and_classification]# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
(sent) [root@117d5c68ae2a finetuning_and_classification]# python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version.cuda)
10.0.130
2.使用时碰到segmentation fault(core dumped)
注意:最好是gcc5.x且gcc<6.0,conda安装很方便
conda install -c psi4 gcc-5
参考:https://github.com/NVIDIA/apex/issues/35
I successfully install the apex but face the segmentation fault when training.
So I check the installation again. I notice that I ignore the warning when installing apex. My bad.
Remember to update the gcc 4.x to gcc 5.x , which may also lead to segmentation fault
In a nutshell, the solution is
conda install -c psi4 gcc-5
3.如果碰到FusedLayerNorm有关的错误,可能是和没装cuda的扩展参考:
参考:https://github.com/NVIDIA/apex/issues/214
Also, before reinstalling Apex, you need to make sure any old conflicting installs are removed, and if you installed using the direct setup.py command, you also need to make sure stale apex/build and apex.egg-info are removed. Try
$ pip uninstall apex
$ pip uninstall apex (repeat until you're sure it's uninstalled...)
$ cd apex
$ rm -rf build
$ rm -rf apex.egg-info
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
或者 $python setup.py install --cuda_ext --cpp_ext
若出现error: command 'gcc' failed with exit status 1
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
我是切换分支报了错,所以强制切换分支
git checkout -f f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
4.碰到"GLIBCXX_3.4.20' not found"这个问题
参考链接:https://blog.csdn.net/zhangyingna667/article/details/107290495?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control
因为升级gcc时,生成的动态库没有替换老版本gcc的动态库
(sent) [root@117d5c68ae2a finetuning_and_classification]# python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import apex
>>> input = torch.rand(3, 10).cuda()
>>> fln = apex.normalization.FusedLayerNorm(10).cuda()
>>> fln(input)
即可正常使用apex
参考链接:
1.https://juejin.cn/post/6844903817499115534
2.https://zhuanlan.zhihu.com/p/140347418?utm_source=wechat_session
3.http://ws.nju.edu.cn/blog/2019/10/%e5%9c%a8conda%e5%ae%89%e8%a3%85%e7%9a%84cuda%e7%8e%af%e5%a2%83%e4%b8%ad%e5%ae%89%e8%a3%85apex/