pytorch有自带的Dataset类和dataloader函数按批返回数据,应用的例子可以看这个。
这篇文章里我们来看一看dataloader的代码是如何实现的。
Dataloader类的初始化函数里有一些参数值得注意:
if sampler is None: # give default samplers
if self._dataset_kind == _DatasetKind.Iterable:
# See NOTE [ Custom Samplers and IterableDataset ]
sampler = _InfiniteConstantSampler()
else: # map-style
if shuffle:
sampler = RandomSampler(dataset)
else:
sampler = SequentialSampler(dataset)
if batch_size is not None and batch_sampler is None:
# auto_collation without custom batch_sampler
batch_sampler = BatchSampler(sampler, batch_size, drop_last)
在这里根据shuffle是否为true分别调用了RandomSampler(dataset)和SequentialSampler(dataset)。 batch_sampler 由 BatchSampler得到,构造一个batch的代码如下,本质上还是一个generator,它从sampler获取index,直到达到所需的batch_size。:
def __iter__(self):
batch = []
for idx in self.sampler:
batch.append(idx)
if len(batch) == self.batch_size:
yield batch
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
那么sampler又是如何构造的呢?
先来看SequentialSampler,就是很简单的把数据加载进来,通过iter函数返回数据集大小内的数字。
class SequentialSampler(Sampler):
r"""Samples elements sequentially, always in the same order.
Arguments:
data_source (Dataset): dataset to sample from
"""
def __init__(self, data_source):
self.data_source = data_source
def __iter__(self):
return iter(range(len(self.data_source)))
def __len__(self):
return len(self.data_source)
RandomSampler相对复杂一些,不过原理与sequential sampler相比,iter函数返回的数据集大小范围内的随机数,而不是按顺序排列的index。
class RandomSampler(Sampler):
r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
If with replacement, then user can specify :attr:`num_samples` to draw.
Arguments:
data_source (Dataset): dataset to sample from
replacement (bool): samples are drawn with replacement if ``True``, default=``False``
num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
is supposed to be specified only when `replacement` is ``True``.
"""
def __init__(self, data_source, replacement=False, num_samples=None):
self.data_source = data_source
self.replacement = replacement
self._num_samples = num_samples
if not isinstance(self.replacement, bool):
raise ValueError("replacement should be a boolean value, but got "
"replacement={}".format(self.replacement))
if self._num_samples is not None and not replacement:
raise ValueError("With replacement=False, num_samples should not be specified, "
"since a random permute will be performed.")
if not isinstance(self.num_samples, int) or self.num_samples <= 0:
raise ValueError("num_samples should be a positive integer "
"value, but got num_samples={}".format(self.num_samples))
@property
def num_samples(self):
# dataset size might change at runtime
if self._num_samples is None:
return len(self.data_source)
return self._num_samples
def __iter__(self):
n = len(self.data_source)
if self.replacement:
return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
return iter(torch.randperm(n).tolist())
def __len__(self):
return self.num_samples
接下来是collate_fn的设置:
if collate_fn is None:
if self._auto_collation:
collate_fn = _utils.collate.default_collate
else:
collate_fn = _utils.collate.default_convert
collate_fn的作用是将每个数据字段放入具有batch size大小的张量。由dataloader获得的是一个batch大小的张量,比如batch是4,图片大小(3,64,64),dataloader给出来的tensor的大小为(4,3,64,64),collate_fn的作用就是把这些输入图片叠在一起成为一个tensor。
如果想要dataloader输出不一样的数据,可以自己定义collate_fn函数,这篇里也有例子。
参考:
https://github.com/pytorch/pytorch/tree/e870a9a87042805cd52973e36534357f428a0748/torch/utils/data
https://pytorch.org/docs/stable/data.html