参考:https://github.com/pytorch/pytorch/issues/11929#issuecomment-649760983
This issue could be solved and the solution is simple:
- Do not open hdf5 inside __init__
- Open the hdf5 at the first data iteration.
Here is an illustration:
class LXRTDataLoader(torch.utils.data.Dataset):
def __init__(self):
"""do not open hdf5 here!!"""
def open_hdf5(self):
self.img_hdf5 = h5py.File('img.hdf5', 'r')
self.dataset = self.img_hdf5['dataset'] # if you want dataset.
def __getitem__(self, item: int):
if not hasattr(self, 'img_hdf5'):
self.open_hdf5()
img0 = self.img_hdf5['dataset'][0] # Do loading here
img1 = self.dataset[1]
return img0, img1
Then the dataloader with num_workers > 1 could just be normally used.
train_loader = torch.utils.data.DataLoader(
dataset=train_tset,
batch_size=32,
num_workers=4
)
Explanation
The multi-processing actually happens when you create the data iterator (e.g., when calling for datum in dataloader:):
for i in range(self._num_workers):
index_queue = multiprocessing_context.Queue()
# index_queue.cancel_join_thread()
w = multiprocessing_context.Process(
target=_utils.worker._worker_loop,
args=(self._dataset_kind, self._dataset, index_queue,
self._worker_result_queue, self._workers_done_event,
self._auto_collation, self._collate_fn, self._drop_last,
self._base_seed + i, self._worker_init_fn, i, self._num_workers))
In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.
If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:
- The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
- The state of the hdfs is copied, which might not faithfully indicate the current state.
In the previous way, we bypass this two issues.