看pytorch official tutorials的新收获（三）--分类损失计算

分类任务是CV里面最基本的一个task，自然离不开分类损失，一般我们看代码都会有两种写法，今天我就想清楚的搞懂一下：
第一种写法：

criterion = nn.CrossEntropy()
loss = criterion(output, target)

第二种写法：

output = F.log_softmax(x, dim=1)
 loss = F.nll_loss(output, target)

11 torch.nn.functional.log_softmax()

这个函数的全部是：
torch.nn.functional.log_softmax(input, dim=None, _stacklevel=3, dtype=None)
我是最近想自己从头写一个分类任务的时候看pytorch的官方examples中MNIST的训练文件里面模型的forward()函数里面突然用了这个函数，之前好像全连接层出来之后就没有了（来自这里）
在torch.nn.functional.softmax(input, dim=None, _stacklevel=3, dtype=None)下面有这么一段话：

This function doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use log_softmax instead (it’s faster and has better numerical properties).

也就是说，对于NLLLoss这种损失函数，是期望数据经过softmax之后再经过对数log运算的，所以在这里它这里采用了log_softmax()，之后就对应用了F.nll_loss损失函数。并且log_softmax()运算更加快和数值稳定，可以用dtype参数来控制输入的数据类型以防止溢出。
以下是softmax和log_softmax的测试：

>>> import torch
>>> import torch.nn.functional as F
>>> a = torch.tensor([[1, 1, 1], [2, 2, 2]], dtype=torch.float)
>>> a
tensor([[1., 1., 1.],
        [2., 2., 2.]])
>>> F.softmax(a)
tensor([[0.3333, 0.3333, 0.3333],
        [0.3333, 0.3333, 0.3333]])
>>> F.softmax(a,dim=0)
tensor([[0.2689, 0.2689, 0.2689],
        [0.7311, 0.7311, 0.7311]])  
# 这里好像看出softmax的dim参数好像默认是1，而不是官方文档里面的None      
>>> F.log_softmax(a, dim=1)
tensor([[-1.0986, -1.0986, -1.0986],
        [-1.0986, -1.0986, -1.0986]])
>>> F.log_softmax(a, dim=0)
tensor([[-1.3133, -1.3133, -1.3133],
        [-0.3133, -0.3133, -0.3133]])
# 在用log_softmax()函数时不指定dim会有warning.这个函数就是softmax和log的二合一

这里多说一句：torch.nn.Softmax(dim=None)和torch.nn.LogSoftmax(dim=None)其实功能是和F.softmax()和F.log_softmax()都是一样的，看源码就知道前者是调用后者的（对LogSoftmax也是一样的）：

[docs]class Softmax(Module):
    r"""Applies the Softmax function to an n-dimensional input Tensor
    rescaling them so that the elements of the n-dimensional output Tensor
    lie in the range [0,1] and sum to 1.

    Softmax is defined as:

    .. math::
        \text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}

    Shape:
        - Input: :math:`(*)` where `*` means, any number of additional
          dimensions
        - Output: :math:`(*)`, same shape as the input

    Returns:
        a Tensor of the same dimension and shape as the input with
        values in the range [0, 1]

    Arguments:
        dim (int): A dimension along which Softmax will be computed (so every slice
            along dim will sum to 1).

    .. note::
        This module doesn't work directly with NLLLoss,
        which expects the Log to be computed between the Softmax and itself.
        Use `LogSoftmax` instead (it's faster and has better numerical properties).

    Examples::

        >>> m = nn.Softmax(dim=1)
        >>> input = torch.randn(2, 3)
        >>> output = m(input)
    """
    __constants__ = ['dim']

    def __init__(self, dim=None):
        super(Softmax, self).__init__()
        self.dim = dim

    def __setstate__(self, state):
        self.__dict__.update(state)
        if not hasattr(self, 'dim'):
            self.dim = None

    def forward(self, input):
        return F.softmax(input, self.dim, _stacklevel=5)

    def extra_repr(self):
        return 'dim={dim}'.format(dim=self.dim)

12 torch.nn.functional.nll_loss()

这个函数的全部是：
torch.nn.functional.nll_loss(input, target, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
也就是The negative log likelihood loss，公式是长这样的：
$Loss=-\sum_{k} gt_{k} \log p_{k}$ 这里的 $p_{k}$ 就是预测出每类的概率分布， $gt_{k}$ 也就是ground truth概率分布，也就是所谓的标签target。上面的F.log_softmax()其实就是计算出了 $log p_{k}$ ，而这也是F.nll_loss()函数的输入之一（官网原话：The input given through a forward call is expected to contain log-probabilities of each class），而另一个输入就是标签喽，我们给出的标签是一个索引（我通过打印出pytorch自带的MNIST和CIFAR10数据集的标签也证实了这一点），F.nll_loss()在计算的时候会自动转化为one-hot编码（官网原话：The target that this loss expects should be a class index in the range [0, C-1] where C = number of classes），其实这也就是一个交叉熵损失，和下面要讲的另一个损失函数其实是一样的（关于交叉熵可以去看这一篇博客，文末也会放一些关系式），来看几个例子

>>> a = torch.tensor([[10,10,10,10,10],[20,20,20,20,20]], dtype=torch.float)
>>> target = torch.tensor([0, 0])
>>> loss = F.nll_loss(F.log_softmax(a), target)
>>> loss
tensor(1.6094)
>>> loss = F.nll_loss(F.log_softmax(a), target,size_average=False)

Warning (from warnings module):
  File "E:\Miniconda3\lib\site-packages\torch\nn\_reduction.py", line 49
    warnings.warn(warning.format(ret))
UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
>>> loss
tensor(3.2189)
>>> loss = F.nll_loss(F.log_softmax(a), target,reduction='sum')
>>> loss
tensor(3.2189)
>>> loss = F.nll_loss(F.log_softmax(a), target,reduction='mean')
>>> loss
tensor(1.6094)
>>> loss = F.nll_loss(F.log_softmax(a), target,reduction='none')
>>> loss
tensor([1.6094, 1.6094])

注意：F.nll_loss()尽量不要用size_average和reduce这两个参数了，建议使用reduction这个参数，当reduction='sum'的时候，就表示把这个batch里面每个example的损失加起来，而reduction='mean'就是做平均.
上面的运算过程可以用下面这幅图表示：

计算过程

13torch.nn.CrossEntropyLoss()

这个函数的全部是：
torch.nn.CrossEntropyLoss(weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean')
当然有其对应的F.cross_entropy()函数，从参数上可以看出，这个和前面的F..nll_loss()功能是差不多的，那为什么还有这个函数呢？
因为：官网原话：This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.也就是说，这里可以直接拿logits和target进行交叉熵损失的计算（所谓的logits就是网络的直接输出，不经过什么softmax和log的操作，也就是不用在forward函数里面写别的东西了，也就是开头给出的第一种写法），其他功能是类似的，不再赘述，放几张官网截图吧：

官网截图1

官网截图2

Bonus

总结来说，最后代码里面只要挑一种写法写就好了，下面放一点关于交叉熵的分享：
熵是信息论里面的知识，而交叉熵用来衡量两个分布之间的“距离”
Entropy（熵）：
$H(p)=-\sum_{i} p_{i} \log p_{i}$
CrossEntropy（交叉熵）：
$H(p,q)=-\sum_{i} p_{i} \log q_{i}$
Entropy（熵）与CrossEntropy（交叉熵）的关系：
$H(p,q)=H(p)+D_{KL}(p|q)$
$H(p,q) \ge H(p)$
当分布 $P=Q$ 时： $CrossEntropy = Entropy$ ，即 $D_{KL}(P|Q)=0$
对于one-hot encoding， $entropy=1log(1)=0$ ，即此时 $H(P,Q)=D_{KL}(P|Q)$ ，此时我们要学习的目标就是让这两个分布尽量接近。