train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=batch_size,
# shuffle=True,
drop_last=False,
num_workers=num_workers,
sampler=train_sample,
pin_memory=False
如果设置num_works>0,会报错
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the ‘spawn’ start method
原因: 不能在分支的子线程里 重新初始化cuda,这是什么意思呢?
解决方案:
- torch.multiprocessing.set_start_method('spawn')
spawn支持使用多线程cuda,但对我不适用 - 设置num_workers=0, 可以不报错,但是dataloader时会大幅拖慢训练速度
-
推荐检查dataset定义代码,报错代码中也会指定到对应的行
dataset.getitem 的return的tensor,已经to(device),造成多线程错误
def __getitem__(self, index: int)
mytensor = torch.tensor(
self.array,
dtype=self.dtype
).to(device)
return mytensor
删除to(device)
然后在训练时
for x,y in enumerate(train_loader):
x=x.to(device)
y=y.to(deivce)
完美解决!