一、问题描述
我之前按照tensorflow
官网的脚本安装了CUDA10.1
和cudnn
,也能也能在Python
中正常导入tensorflow
,但是也就放心了,然后今天进行数据训练的时候爆出错误如下:
2020-05-10 20:51:10.929736: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-05-10 20:51:10.930780: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-05-10 20:51:10.931057: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: G5-5587
2020-05-10 20:51:10.931156: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: G5-5587
2020-05-10 20:51:10.932180: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2020-05-10 20:51:10.934319: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2020-05-10 20:51:10.942166: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-10 20:51:11.080120: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2208000000 Hz
2020-05-10 20:51:11.084013: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x18214d5a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-10 20:51:11.084074: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Epoch 1/100
2020-05-10 20:51:39.497298: W tensorflow/core/framework/op_kernel.cc:1730] OP_REQUIRES failed at cast_op.cc:123 : Unimplemented: Cast string to float is not supported
不能打开`libcuda.so.1'。
二、处理过程
查看gpu
信息:
nvidia-smi #输入
得到:
Sun May 10 20:41:16 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:01:00.0 Off | N/A |
| N/A 57C P8 6W / N/A | 224MiB / 6078MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1183 G /usr/lib/xorg/Xorg 132MiB |
| 0 2058 G compiz 78MiB |
| 0 2374 G fcitx-qimpanel 6MiB |
| 0 2889 G /usr/lib/firefox/firefox 1MiB |
| 0 3462 G /usr/lib/firefox/firefox 1MiB |
+-----------------------------------------------------------------------------+
再输入
nvcc --version
输出:
The program 'nvcc' is currently not installed. You can install it by typing:
sudo apt install nvidia-cuda-toolkit
提示让我安装nvidia-cuda-toolkit
,照做:
sudo apt install nvidia-cuda-toolkit
成功后重新调用之前的出错代码,得到输出:
2020-05-10 21:07:53.675736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-05-10 21:07:53.686611: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2020-05-10 21:07:53.687012: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: G5-5587
2020-05-10 21:07:53.687077: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: G5-5587
2020-05-10 21:07:53.688438: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.64.0
2020-05-10 21:07:53.689387: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2020-05-10 21:07:53.689492: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 418.87.1 does not match DSO version 440.64.0 -- cannot find working devices in this configuration
2020-05-10 21:07:53.696530: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-05-10 21:07:53.863791: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2208000000 Hz
2020-05-10 21:07:53.866341: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x18334c140 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-05-10 21:07:53.866364: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
可以留意到如下的版本不匹配信息:
kernel version 418.87.1 does not match DSO version 440.64.0
所以,卸载掉nvidia-cuda-toolkit
,在重新安装特定版本:
顺带学到的一个nvcc处理办法The program 'nvcc' is currently not installed. You can install it by typing:
在~/.bashrc
添加配置即可:
# cuda 10.1
export LD_LIBRARY_PATH=/usr/local/cuda/lib
export PATH=$PATH:/usr/local/cuda/bin
# cuda 10.1