slurm计算集群问题解决记录

问题：slurm系统运行保错，原因为master节点被重启，需要重新设置slurm集群
squeue: error: If munged is up, restart with --num-threads=10

squeue: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory

squeue: error: slurm_send_node_msg: auth_g_create: REQUEST_JOB_INFO has authentication error

slurm_load_jobs error: Protocol authentication error

systemctl restart munge.service

image.png

无法重启munge服务

检查系统日志journalctl 或者vim /var/log/messages

image.png

mkdir /var/run/munge && chown munge:munge /var/run/munge
执行这两个后重启下munge服务：systemctl restart munge.service

image.png

计算节点显示down，ssh 计算节点把各计算节点slurmd服务重启：
clush -a "systemctl restart slurmd"

检查每个计算节点共享目录挂载情况： df -h

image.png

未挂载管理节点共享目录，在管理节点运行：systemctl restart nfs，完成挂载

image.png

上线已经down的计算节点

scontrol update NodeName=comput1 State=idle

scontrol update NodeName=comput4 State=idle

image.png

全部计算节点完成上线idle

遗憾的是，共享目录在管理节点，管理节点挂了，任务都会死掉，PD的任务只能删除了重新提交。

最后编辑于：2023.04.27 15:50:15

slurm计算集群问题解决记录