FCN:Fully Convolutional Networks for Semantic Segmentation的阅读与pytorch实现

作者： 心有宝宝人自圆

声明：欢迎转载本文中的图片或文字，请说明出处

写在前面

本篇文章介绍的FCN是语义分割（Semantic Segmentation）之中Fully Convolutional Network结构流派的开山鼻祖，以至于之后的语义分割研究基本采取了这种结构。

语义分割的目标是为每图片中的每一个pixel进行类别的预测(Dense Prediction)

本文的主体内容十分容易理解，但是一些作者介绍的tricks让人看得云里雾里的（关键这些tricks作者最后一般都没使用😓），所以对应FCN的原理理解可以忽略这些tricks（但我还会分享一些理解😀）

最后我给出了我的代码和结果（pytorch），训练这个真是太难了😭以下坑点万分注意：

a) resize时，插值的方法一定要选择NEAREAST而不是默认的Bilinear，否则会对true label image的pixel进行误标

b) 一定要充足的耐心进行训练，不然你的分割图像一直是黑的（没用非极大抑制大概80 epochs，使用后大概40 epochs）

c) 关于不同的loss：loss的设计可能会出现梯度爆炸的现象（若非loss的设计问题），batch size不要设太大；有时候loss的设计实际影响了收敛时间（就是分割图像一直是黑持续时间）

1. Introduction

语义分割的目标就是要为每个像素做出预测，每个像素要被标记为包含它的目标的类别。而FCN是第一个使用end-to-end，pixel-to-pixel训练的语义分割方法。FCN能使用任意大小的图像作为输入（去除了网络中的fully connected layers），进行密集预测。学习特征和推断分别通过feedforward（下采样）和backpropagation（上采样）进行，这样的结构特征使网络可以进行pixelwise预测。

作者介绍了语义分割一种内在矛盾：全局信息（global/semantic information）和位置信息（location information）。他们分别代表network高层和低层的特征，作者形象的称号它们为what和where。高层的信息经过了下采样所以更能代表类别信息，而低层则包含了目标的细节信息，而语义分割则需要全局信息和位置信息的共同编码，否则在目标边缘的预测会变得很不准却。为了解决这一问题作者设计了一种跳跃结构（Skip Architect），进而结合了两种信息

2. Related Work

在FCN提出之前，语义分割基本是基于pitch-wise训练的（fine-tuned R-CNN system），以选择性搜索选取一定大小的proposal region，使用CNN提取proposal region的特征传入分类器。这样的操着并非end-to-end，proposal region的大小一般是先验指定（限制了模型感受野的大小，使其仅对某些scale的特征敏感），随机选取的proposal region可能高度重叠而造成计算、存储资源的过多消耗。

3. Fully Convolution Networks

f表示卷积或池化，x为输入，y为输出，k是kernel size，s是stride or subsampling factor；下式则表示连续的卷积或池化可以合成等效的一步（当然非线性的激活函数也可以代表，但它对下采样过程没有作用）。这个公式也可以说明为什么5x5，stride=1的卷积可以转化成2个3x3，stride=1的卷积

)

关于损失函数：

每一图像的损失是每个空间空间点的损失之和 $l(x;\theta)=\sum_{ij}l^{'}(x_{ij};\theta)$

因此每个图像的随机梯度下降等于每个空间点的梯度下降之和

3.1 改编分类器以适应密集预测

传统的分类器（全连接层）使用固定大小的输入，产生非空间性的输出，因此全连接层被认为是固定size、丢弃了空间信息；然而全连接也可以视为kernels覆盖了整个输入的卷积层（这样就可以将全连接层与卷积层相互转换），而卷积层可接受任意大小的输入，输出分类maps。使用卷积层代替全连接能带来更高的计算效率

2.PNG

然而输出的分类maps（粗糙输出）的维度由于经过下采样而比原始输入的维度更小

3.2 Shift-and-stitch是滤波稀疏

为了将全卷积网络的粗糙输出转化到原始空间的密集预测，作者引入了input shifting（输入平移）和output interlacing（输出交织）的trick（然而这并非作者最终选择使用的上采样策略😓）

给定下采样因子 $f(stride)$ ，将将原始输入分别从左上(0，0)开始，分别向右和向下平移 $[0，f-1]$ 个像素，共得到 $f^2$ 个输入分别通过全卷积网络产生 $f^2$ 个output，将这些结果交织在一起就能得到原始输入空间大小的输出，这样的预测结果与感受野中心像素有关。可以看出shift-and-stitch与传统的上采样方法（如双线性插值）是不一样的。然而这种做法并没有真正利用到低层更细节的信息

之后作者有想出了一个trick：缩小卷积核（等同于对原始图像进行上采样），同样可以达到输出的维度与输入的维度相同。然而这种做法导致卷积层的感受野过小、更长的计算时间

3.3 上采样是反向的卷积

在神经网络里，一个关于上采样的自然想法便是反向传播，所以作者就采用反卷积(deconvolution)的方法进行上采样。deconvolution中的卷积转置层的参数是可学习的，然而在作者的仓库中，设定其为固定值(作者实际使用了双线性插值的方法)

3.4 Patchwise training是损失的采样

在随机优化中，梯度的计算实际是由训练的分布驱动的。Patchwising training和fully-conv training都可以产生任意的分布（即使它们的效率与重叠部分和小批量的大小有关）。通常来说后者比前者的效率更高（更少的batches）。

对于patchwise training的采样可以减少类别不平衡和缓解空间的相关性；在fully-conv training中，类别的平衡和缓解空间的相关性可以通过对loss的加权或下采样loss得到。然而4.3节的结果表明下采样并没有对结果产生显著的影响（类别不平衡为对FCN并不重要），仅加快了收敛的速度

4 分割结构

（迁移学习+微调）

从预训练网络的全连接层截断的网络，之前的网络直接使用，全连接层转换为卷积层（除GoogLeNet）。

把最后的输出层换为输出通道为类别数的1x1卷积层

（输入在以上结构的向前传播的结果称为coarse output）

网络中加入反卷积层进行上采样（实际是固定的双线性插值）

作者提出了一种新颖的skip architect（跳跃结构），结合了高层的位置信息和低层的细节信息

3.PNG

FCN-32s：将coarse out通过deconv（双线性插值）直接上采样32倍
FCN-16s：将coarse out通过deconv（双线性插值）上采样2倍；使用1x1卷积层处理pool4的输出使其输出通道为类别数（额外的预测器）；将前两步结果相加（为方便记作coarse out 2x）后通过deconv（双线性插值）上采样16倍
FCN-8s：将coarse out 2x通过deconv（双线性插值）上采样2倍；使用1x1卷积层处理pool3的输出使其输出通道为类别数（额外的预测器）；将前两步结果相加后通过deconv（双线性插值）上采样8倍

当继续采用更低层输出的跳跃结构后，模型遇到了衰减回馈（diminishing returns），不能对meanIoU等指标产生明显的改善，因此跳跃结构仅到8s就截止了。

实验框架

优化器：SGD with momentum=0.9，weight decay= $5^{-4}$ or $2^{-4}$ （尽管训练对这些参数不敏感，但对learning rate敏感）, $10^{-3},10^{-4},5^{-5}$ for FCN-AlexNet, Vgg-16, GoogLeNet，原分类器中转化来的卷积层使用Dropout
微调：需花费很长时间，由FCN-32s（微调时作者用了3天......）向16s和8s微调
Patch sampling：使用整个图像进行训练的效果和sampling patches的效果差不多，且整个图像进行训练需要的收敛时间更短，所以直接使用完整图像进行训练
Class Balancing：正负类的不平衡（背景类为负类）对训练的效果没有显著影响（所以作者直接使用了所有像素计算loss，而没有进行hard negative mining）
Dense Prediction：采用deconv（双线性插值）进行上采样，而未使用3节中其他trick
数据增强：随机镜像和缩小输入的scale（增强网络对小尺度目标的能力）并未产生显著的效果提升
更多的训练数据：更好的效果

5. My codes

我使用的是PASCAL VOC2012的数据集，按其划分好的trainval来进行训练

为每个分割图像进行标注：每个pixel表为对应的类别(0-20，0代表背景)

# 每个RGB颜色的值及其标注的类别
VOC_COLORMAP = [[0, 0, 0], [128, 0, 0], [0, 128, 0], [128, 128, 0],
                [0, 0, 128], [128, 0, 128], [0, 128, 128], [128, 128, 128],
                [64, 0, 0], [192, 0, 0], [64, 128, 0], [192, 128, 0],
                [64, 0, 128], [192, 0, 128], [64, 128, 128], [192, 128, 128],
                [0, 64, 0], [128, 64, 0], [0, 192, 0], [128, 192, 0],
                [0, 64, 128]]

VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat',
               'bottle', 'bus', 'car', 'cat', 'chair', 'cow',
               'diningtable', 'dog', 'horse', 'motorbike', 'person',
               'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor']

# CLASSES_LABEL = {k: v for k, v in zip(VOC_COLORMAP, VOC_CLASSES)}

# 为每个(R, G, B)组合分配类别
colormap2label = torch.zeros((256, 256, 256), dtype=torch.long)
for i, color in enumerate(VOC_COLORMAP):
    colormap2label[color[0], color[1], color[2]] = i


def get_pixel_label(segmentation_image):
    """
    为分割标记图像的每个像素分配类别标签
    :param segmentation_image: 标记图像，a PIL image
    :return: a tensor of (image.height, image.width)，为每个像素分配了类别标签
    """
    cmap = np.array(segmentation_image.convert('RGB'), dtype=np.uint8)

    cmap = colormap2label[
        cmap[:, :, 0].flatten().tolist(), cmap[:, :, 1].flatten().tolist(), cmap[:, :, 2].flatten().tolist()].reshape(
        cmap.shape[0], cmap.shape[1])
    return cmap

网络的结构

这里只列出了FCN32s和FCN8s，使用的是Vgg-16预训练模型（注意deconv的权重初始化双线性插值，不再对其权重进行学习）

import torch
from torch import nn
import torchvision
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


def get_bilinear_weights(in_channels, out_channels, kernel_size):
    """
    构造双线性插值的上采样的权重
    :param in_channels: 转置卷积层的输入通道数
    :param out_channels: 转置卷积层的输出通道数
    :param kernel_size: 转置卷积层的卷积核大小
    :return: 权重, a tensor in shape of (in_channels, out_channels , kernel_size, kernel_size)
    """
    factor = (kernel_size + 1) // 2
    if kernel_size % 2 == 1:
        center = factor - 1  # array从0开始以需要-1
    else:
        center = factor - 0.5  # center = factor + 0.5 - 1
    og = np.ogrid[:kernel_size, :kernel_size]
    filt = (1 - abs(og[0] - center) / factor) * (1 - abs(og[1] - center) / factor)
    weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size), dtype=np.float32)
    weight[range(in_channels), range(out_channels), :, :] = filt  # 只对对角线上核的值进行替换
    return torch.from_numpy(weight)


class FCN32s(nn.Module):
    def __init__(self, n_classes):
        super(FCN32s, self).__init__()

        # 直接使用Vgg-16预训练网络，抛弃classifier层，并把fc层转换为卷积层
        # fc6转化为conv6，使用的卷积核大小为7x7，该层输出长度有6个像素的损失，
        # 向上采样32倍即原始空间192个像素的损失，因而小于192x192的输入会导致报错
        # 同时这些像素损失必需通过padding使上采样的空间大小与原输入空间一致
        # 其实这个值可以属于(96,112)都能达到以上效果

        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=100)
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(2, 2)

        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(2, 2)

        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.pool3 = nn.MaxPool2d(2, 2)

        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool4 = nn.MaxPool2d(2, 2)

        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool5 = nn.MaxPool2d(2, 2)

        self.conv6 = nn.Conv2d(512, 4096, kernel_size=7)
        self.dropout6 = nn.Dropout2d()

        self.conv7 = nn.Conv2d(4096, 4096, kernel_size=1)
        self.dropout7 = nn.Dropout2d()

        self.load_pretrained_layers()

        self.score = nn.Conv2d(4096, n_classes, 1)

        # 此处的kernel_size我认为是作者主观选择的，默认是下采样率的2倍
        self.upsample = nn.ConvTranspose2d(n_classes, n_classes, kernel_size=64, stride=32, bias=False)

        self.upsample.weight.data = get_bilinear_weights(n_classes, n_classes, kernel_size=64)
        self.upsample.weight.requires_grad = False

    def forward(self, x):
        # 我们假设输入图片的height, width均为能被32整除
        out = torch.relu(self.conv1_1(x))  # (b, 64, h+198, w+198)
        out = torch.relu(self.conv1_2(out))  # (b, 64, h+198, w+198)
        out = self.pool1(out)  # (b, 64, h/2 + 99, w/2 +99)

        out = torch.relu(self.conv2_1(out))  # (b, 128, h/2+99, w+99)
        out = torch.relu(self.conv2_2(out))  # (b, 128, h/2+99, w+99)
        out = self.pool2(out)  # (b, 128, h/4 + 49, w/4 + 49)

        out = torch.relu(self.conv3_1(out))
        out = torch.relu(self.conv3_2(out))
        out = torch.relu(self.conv3_3(out))
        out = self.pool3(out)  # (b, 256, h/8 + 24, w/18 + 24)

        out = torch.relu(self.conv4_1(out))
        out = torch.relu(self.conv4_2(out))
        out = torch.relu(self.conv4_3(out))
        out = self.pool4(out)  # (b, 512, h/16 + 12, w/16 + 12)

        out = torch.relu(self.conv5_1(out))
        out = torch.relu(self.conv5_2(out))
        out = torch.relu(self.conv5_3(out))
        out = self.pool5(out)  # (b, 512, h/32 + 6, w/32 + 6)

        out = torch.relu(self.conv6(out))  # (b, 512, h/32, w/32)
        out = self.dropout6(out)

        out = torch.relu(self.conv7(out))
        out = self.dropout7(out)

        out = self.score(out)

        # 由于转置卷积的卷积核大小使上采样32倍后比原始size大了(kernel_size - stride)
        out = self.upsample(out)  # (b, n_classes, h+32, w+32)

        return out[:, :, 16:16 + x.shape[2], 16:16 + x.shape[3]].contiguous()

def load_pretrained_layers(self):
        state_dict = self.state_dict()
        param_names = list(state_dict.keys())

        pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
        pretrained_param_names = list(pretrained_state_dict.keys())

        for i, param in enumerate(param_names[:-4]):
            state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]

        state_dict['conv6.weight'] = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7)
        state_dict['conv6.bias'] = pretrained_state_dict['classifier.0.bias']

        state_dict['conv7.weight'] = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1)
        state_dict['conv6.bias'] = pretrained_state_dict['classifier.3.bias']
        self.load_state_dict(state_dict)
    
class FCN8s(nn.Module):
    def __init__(self, n_classes):
        super(FCN8s, self).__init__()

        # 直接使用Vgg-16预训练网络，抛弃classifier层，并把fc层转换为卷积层
        # fc6转化为conv6，使用的卷积核大小为7x7，该层输出长度有6个像素的损失，
        # 向上采样32倍即原始空间192个像素的损失，因而小于192x192的输入会导致报错
        # 同时这些像素损失必需通过padding使上采样的空间大小与原输入空间一致
        # 其实这个值可以属于(96,112)都能达到以上效果

        self.conv1_1 = nn.Conv2d(3, 64, kernel_size=3, padding=100)
        self.conv1_2 = nn.Conv2d(64, 64, kernel_size=3, padding=1)
        self.pool1 = nn.MaxPool2d(2, 2)

        self.conv2_1 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.conv2_2 = nn.Conv2d(128, 128, kernel_size=3, padding=1)
        self.pool2 = nn.MaxPool2d(2, 2)

        self.conv3_1 = nn.Conv2d(128, 256, kernel_size=3, padding=1)
        self.conv3_2 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.conv3_3 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.pool3 = nn.MaxPool2d(2, 2)

        self.conv4_1 = nn.Conv2d(256, 512, kernel_size=3, padding=1)
        self.conv4_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv4_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool4 = nn.MaxPool2d(2, 2)

        self.conv5_1 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_2 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.conv5_3 = nn.Conv2d(512, 512, kernel_size=3, padding=1)
        self.pool5 = nn.MaxPool2d(2, 2)

        self.conv6 = nn.Conv2d(512, 4096, kernel_size=7)
        self.dropout6 = nn.Dropout2d()

        self.conv7 = nn.Conv2d(4096, 4096, kernel_size=1)
        self.dropout7 = nn.Dropout2d()

        self.load_pretrained_layers()

        self.score = nn.Conv2d(4096, n_classes, 1)
        self.score_pool4 = nn.Conv2d(512, n_classes, 1)
        self.score_pool3 = nn.Conv2d(256, n_classes, 1)

        # 此处的kernel_size我认为是作者主观选择的，默认是下采样率的2倍
        self.upsample_2x = nn.ConvTranspose2d(n_classes, n_classes, kernel_size=4, stride=2, bias=False)
        self.upsample_8x = nn.ConvTranspose2d(n_classes, n_classes, kernel_size=16, stride=8, bias=False)

        self.upsample_2x.weight.data = get_bilinear_weights(n_classes, n_classes, kernel_size=4)
        self.upsample_2x.weight.requires_grad = False
        self.upsample_8x.weight.data = get_bilinear_weights(n_classes, n_classes, kernel_size=16)
        self.upsample_8x.weight.requires_grad = False

    def forward(self, x):
        # 我们假设输入图片的height, width均为能被32整除
        out = torch.relu(self.conv1_1(x))  # (b, 64, h+198, w+198)
        out = torch.relu(self.conv1_2(out))  # (b, 64, h+198, w+198)
        out = self.pool1(out)  # (b, 64, h/2 + 99, w/2 +99)

        out = torch.relu(self.conv2_1(out))  # (b, 128, h/2+99, w+99)
        out = torch.relu(self.conv2_2(out))  # (b, 128, h/2+99, w+99)
        out = self.pool2(out)  # (b, 128, h/4 + 49, w/4 + 49)

        out = torch.relu(self.conv3_1(out))
        out = torch.relu(self.conv3_2(out))
        out = torch.relu(self.conv3_3(out))
        out = self.pool3(out)  # (b, 256, h/8 + 24, w/8 + 24)
        pool3 = out

        out = torch.relu(self.conv4_1(out))
        out = torch.relu(self.conv4_2(out))
        out = torch.relu(self.conv4_3(out))
        out = self.pool4(out)  # (b, 512, h/16 + 12, w/16 + 12)
        pool4 = out

        out = torch.relu(self.conv5_1(out))
        out = torch.relu(self.conv5_2(out))
        out = torch.relu(self.conv5_3(out))
        out = self.pool5(out)  # (b, 512, h/32 + 6, w/32 + 6)

        out = torch.relu(self.conv6(out))  # (b, 512, h/32, w/32)
        out = self.dropout6(out)

        out = torch.relu(self.conv7(out))
        out = self.dropout7(out)

        out = self.score(out)

        # 由于转置卷积的卷积核大小使上采样32倍后比原始size大了(kernel_size - stride)
        out = self.upsample_2x(out)  # (b, n_classes, h/16 + 2, w/16 + 2)
        pool4 = self.score_pool4(pool4)  # (b, n_classes, h/16 + 12, w/16 + 12)
        out = out + pool4[:, :, 5:5 + out.size(2), 5:5 + out.size(3)]  # (b, n_classes, h/16 + 2, w/16 + 2)

        out = self.upsample_2x(out)  # (b, n_classes, h/8 + 4 + 2, w/8 + 4 + 2)
        pool3 = self.score_pool3(pool3)  # (b, 256, h/8 + 24, w/8 + 24)
        out = out + pool3[:, :, 9:9 + out.size(2), 9:9 + out.size(3)]  # (b, n_classes, h/8 + 6, w/8 + 6)

        out = self.upsample_8x(out)  # (b, n_classes, h + 48 + 8, w + 48 + 8)

        return out[:, :, 28:28 + x.shape[2], 28:28 + x.shape[3]].contiguous()

    def load_pretrained_layers(self):
        state_dict = self.state_dict()
        param_names = list(state_dict.keys())

        pretrained_state_dict = torchvision.models.vgg16(pretrained=True).state_dict()
        pretrained_param_names = list(pretrained_state_dict.keys())

        for i, param in enumerate(param_names[:-4]):
            state_dict[param] = pretrained_state_dict[pretrained_param_names[i]]

        state_dict['conv6.weight'] = pretrained_state_dict['classifier.0.weight'].view(4096, 512, 7, 7)
        state_dict['conv6.bias'] = pretrained_state_dict['classifier.0.bias']

        state_dict['conv7.weight'] = pretrained_state_dict['classifier.3.weight'].view(4096, 4096, 1, 1)
        state_dict['conv6.bias'] = pretrained_state_dict['classifier.3.bias']
        self.load_state_dict(state_dict)

由于正负类不平衡对于FCN无影响（见第4节），直接使用交叉熵的计算方法来计算pixel loss（注意是2D版）

（其实也可以进行Hard Negative Mining来加快收敛，这里简单起见使用这种方法）

class LossFunction(nn.Module):
    def __init__(self):
        super(LossFunction, self).__init__()
        self.loss = nn.NLLLoss()

     def forward(self, pred, target):
         pred = nn.functional.log_softmax(pred, dim=1)
         loss = self.loss(pred, target)
         return loss

接下来的Dataset、DataLoader的构建、train和valid的具体函数不再详细写了（所有项目都差不多😓）

注意：

在进行数据增广时（resize），插值的方法一定要选择NEAREAST而不是默认的Bilinear，否则会对true label image的pixel进行误标，导致问题的出现
训练要有足够的耐心，作者的32s都训练了3天
关于batch_size，如果选择不进行resize，可以将batch_size设为1

一些衡量的Metrics见：wkentaro/pytorch-fcn，它的算法方法非常巧妙

结果：

6.我的问题

从上面的分割结果来看效果还可以...但那些Metrics的值一直上不去...可能是我训练时间的问题吧（我只训练了大概一天，可能这是最大的问题了吧，对复杂的图像的分割能力有待加强😓），但mIoU只达到了0.28...而且难以再升上去，这个地方使我很苦恼（可能真得训练个3天吧😭）

这里更新一下：终于找到mIoU上不去的原因了
这个问题所在其实很傻，就是在模型的load_pretrained_layer()中，最后忘记加上了self.load_state_dict()了，等于是预训练的网络参数没有用上，而是重新直接训练了😭
其实就这点问题导致训练时间拉了极长、输出为黑的情况出现很长时间。FCN32s的精度太差，收敛的时间还是会稍久一点的，但也不会像重新训练一样那么慢
心碎了一地😭

我思考了一下问题在哪里，可能是数据集过少的问题，也跟可能是某种类别难以识别（有些类的IoU明显较差），训练数据本身不平衡、标注本来就不准确什么的...也可能是FCN模型的真实能力并非想象中那么好...可以试一下让网络学习deconv层的参数，亦或直接按照encoder-decoder的做法重新构建一下网络（虽然更耗时，但肯定能提高细节的预测）

其实大家有功夫可以多训练一下看看效果，我看那种自动驾驶的训练集（Cityscapes）的训练效果会更好一点（数据集里没有背景类）

Reference

[1] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440)

[2] 《动手学深度学习》

[3] wkentaro/pytorch-fcn

转载请说明出处。