使用jupyter执行ai程序的时候,动不动就提示“服务器似乎挂掉了,但是会理科重启的”,如图所示:
image.png
运行.py文件一般也会出问题:
(py37) twsm@twsm-PR4904P:~/project/paper$ python train_merge_kashgari.py
2020-05-20 11:48:56.445877: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 16914055168
Aborted (core dumped)
一般情况下,应该是gpu资源没有被释放,通过nvidia-smi命令可以查看GPU资源占用情况及占用的进程id:
image.png
或者用以下命令也可以看到:
(py37) twsm@twsm-PR4904P:~/project/paper$ fuser -v /dev/nvidia*
USER PID ACCESS COMMAND
/dev/nvidia0: twsm 9087 F...m ZMQbg/1
/dev/nvidia1: twsm 9087 F...m ZMQbg/1
/dev/nvidia2: twsm 9087 F...m ZMQbg/1
/dev/nvidia3: twsm 9087 F...m ZMQbg/1
/dev/nvidiactl: twsm 9087 F...m ZMQbg/1
/dev/nvidia-uvm: twsm 9087 F.... ZMQbg/1
使用ps -ef 看看python进程:
(py37) twsm@twsm-PR4904P:~/project/paper$ ps -ef | grep python
twsm 3535 2093 0 5月19 pts/9 00:00:14 /home/twsm/anaconda3/envs/py37/bin/python /home/twsm/anaconda3/envs/py37/bin/jupyter-notebook
twsm 9087 3535 99 10:49 ? 00:09:46 /home/twsm/anaconda3/envs/py37/bin/python -m ipykernel_launcher -f /home/twsm/.local/share/jupyter/runtime/kernel-4380b8f8-ed1b-44f7-b15b-499849b9ef77.json
twsm 9515 3535 0 10:53 ? 00:00:00 /home/twsm/anaconda3/envs/py37/bin/python -m ipykernel_launcher -f /home/twsm/.local/share/jupyter/runtime/kernel-7b392a0b-8b3a-49de-a921-83cbb9839bb3.json
twsm 9532 2029 0 10:57 pts/8 00:00:00 grep --color=auto python
果然可以看到9087的进程。
实际上这个代码在jupyter中已经没有运行了。
一般情况下,这个程序就是你刚才运行的某个jupyter程序代码,在jupyter notebook中选中后,shutdown掉再执行原来的程序就可以了:
image.png