解决h5py 在pytorch dataloader中不能并行加载数据问题

参考：https://github.com/pytorch/pytorch/issues/11929#issuecomment-649760983

This issue could be solved and the solution is simple:

Do not open hdf5 inside __init__
Open the hdf5 at the first data iteration.

Here is an illustration:

class LXRTDataLoader(torch.utils.data.Dataset):
    def __init__(self):
        """do not open hdf5 here!!"""

    def open_hdf5(self):
        self.img_hdf5 = h5py.File('img.hdf5', 'r')
        self.dataset = self.img_hdf5['dataset'] # if you want dataset.

    def __getitem__(self, item: int):
        if not hasattr(self, 'img_hdf5'):
            self.open_hdf5()
        img0 = self.img_hdf5['dataset'][0] # Do loading here
        img1 = self.dataset[1]
        return img0, img1

Then the dataloader with num_workers > 1 could just be normally used.

train_loader = torch.utils.data.DataLoader(
        dataset=train_tset,
        batch_size=32,
        num_workers=4
    )

Explanation

The multi-processing actually happens when you create the data iterator (e.g., when calling for datum in dataloader:):


 for i in range(self._num_workers): 
     index_queue = multiprocessing_context.Queue() 
     # index_queue.cancel_join_thread() 
     w = multiprocessing_context.Process( 
         target=_utils.worker._worker_loop, 
         args=(self._dataset_kind, self._dataset, index_queue, 
               self._worker_result_queue, self._workers_done_event, 
               self._auto_collation, self._collate_fn, self._drop_last, 
               self._base_seed + i, self._worker_init_fn, i, self._num_workers))

In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.
If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:

The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
The state of the hdfs is copied, which might not faithfully indicate the current state.
In the previous way, we bypass this two issues.

解决h5py 在pytorch dataloader中不能并行加载数据问题

Explanation

推荐阅读更多精彩内容