解决h5py 在pytorch dataloader中不能并行加载数据问题

参考:https://github.com/pytorch/pytorch/issues/11929#issuecomment-649760983

This issue could be solved and the solution is simple:

  • Do not open hdf5 inside __init__
  • Open the hdf5 at the first data iteration.

Here is an illustration:

class LXRTDataLoader(torch.utils.data.Dataset):
    def __init__(self):
        """do not open hdf5 here!!"""

    def open_hdf5(self):
        self.img_hdf5 = h5py.File('img.hdf5', 'r')
        self.dataset = self.img_hdf5['dataset'] # if you want dataset.

    def __getitem__(self, item: int):
        if not hasattr(self, 'img_hdf5'):
            self.open_hdf5()
        img0 = self.img_hdf5['dataset'][0] # Do loading here
        img1 = self.dataset[1]
        return img0, img1

Then the dataloader with num_workers > 1 could just be normally used.

train_loader = torch.utils.data.DataLoader(
        dataset=train_tset,
        batch_size=32,
        num_workers=4
    )
Explanation

The multi-processing actually happens when you create the data iterator (e.g., when calling for datum in dataloader:):


 for i in range(self._num_workers): 
     index_queue = multiprocessing_context.Queue() 
     # index_queue.cancel_join_thread() 
     w = multiprocessing_context.Process( 
         target=_utils.worker._worker_loop, 
         args=(self._dataset_kind, self._dataset, index_queue, 
               self._worker_result_queue, self._workers_done_event, 
               self._auto_collation, self._collate_fn, self._drop_last, 
               self._base_seed + i, self._worker_init_fn, i, self._num_workers)) 

In short, it would create multiple processes which "copy" the state of the current process. Thus the opened hdf5 file object would be dedicated to each subprocess if we open it at the first data iteration.
If you somehow create an hdfs file in __init__ and set up the `num_workers' > 0, it might cause two issues:

  • The writing behavior is non-determistic. (We do not need to write to hdf5, thus this issue is ignored.)
  • The state of the hdfs is copied, which might not faithfully indicate the current state.
    In the previous way, we bypass this two issues.
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容