深度学习笔记（五）—— 分类网络的训练问题-1

In this part, we will formally set up a simple but powerful classification network, to recogize 0-9 nubmers in MNIST dataset.

Yep, we will build a classification network and train from scratch.

We would introduce some techniques to improve your train model performance.

This part is designed and completed by Jiaxin Zhuang( zhuangjx5@mail2.sysu.edu.cn ) and Feifei Xue(xueff@mail2.sysu.edu.cn), if you have some questions about this part and you think there are still some things to do, dont't hesitate to email us or add our wechat.

Outline

Outline
1. Required modules ( If you use your own computer, Just pip install it ! )
2. Common Setup
classificatioon network
1. short introdution of MNIST
2. Define a convolutional network
Training
1. Including that define a model, loss function, metric, data-augmentation for training data
2. Pre-set hyper-parameters
3. Initialize model parameters
4. repeat over certain number of epochs
  1. Shuffle whole training data
  2. For each mini-batch data
    1. load mini-batch data
    2. compute gradient of loss over parameters
    3. update parameters with gradient descent
5. save model
Training advanced
1. l2_norm
2. dropout
3. batch_normalization
4. data augmentation
Visualization of training and validation phase
1. add tensorboardX to writer summary into tensorboard
2. download your file in local
3. run tensorboard in pc and open http://localhost:6666 to browse the tensorboard
Gradient
1. Gradient vanishing
2. Gradient exploding

%load_ext autoreload
%autoreload 2

1. Setup

1.1 Required Module

numpy: NumPy is the fundamental package for scientific computing in Python.

pytorch: End-to-end deep learning platform.

torchvision: This package consists of popular datasets, model architectures, and common image transformations for computer vision.

tensorflow: An open source machine learning framework.

tensorboard: A suite of visualization tools to make training easier to understand, debug, and optimize TensorFlow programs.

tensorboardX: Tensorboard for Pytorch.

matplotlib: It is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

1.2 Common Setup

# Load all necessary modules here, for clearness
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# from torchvision.datasets import MNIST
import torchvision
from torchvision import transforms
from torch.optim import lr_scheduler
from tensorboardX import SummaryWriter
from collections import OrderedDict
import matplotlib.pyplot as plt
from tqdm import tqdm

# Whether to put data in GPU according to GPU is available or not 
# cuda = torch.cuda.is_available() 
#  In case the default gpu does not have enough space, you can choose which device to use
# torch.cuda.set_device(device) # device: id

# Since gpu in lab is not enough for your guys, we prefer to cpu computation
cuda = torch.device('cpu')

2. Classfication Model

Ww would define a simple Convolutional Neural Network to classify MNIST

2.1 Short indroduction of MNIST

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database contains 60,000 training images and 10,000 testing images. Each class has 5000 traning images and 1000 test images.

Each image is 32x32.

And they look like images below.

MnistExamples.png

2.2 Define A FeedForward Neural Network

We would fefine a FeedForward Neural Network with 3 hidden layers.

Each layer is followed a activation function, we would try sigmoid and relu respectively.

For simplicity, each hidden layer has the equal neurons.

In reality, however, we would apply different amount of neurons in different hidden layers.

2.2.1 Activation Function

There are many useful activation function and you can choose one of them to use. Usually we use relu as our network function.

2.2.1.1 ReLU

Applies the rectified linear unit function element-wise

\begin{equation}
ReLU(x) = max(0, x)
\end{equation}

1_oePAhrm74RNnNEolprmTaQ.png

2.2.1.2 Sigmoid

Applies the element-wise function:

\begin{equation}
Sigmoid(x)=\frac{1}{1+e^{-x}}
\end{equation}

320px-Logistic-curve.svg.png

2.2.2 Network's Input and output

Inputs: For every batch

[batchSize, channels, height, width] -> [B,C,H,W]

Outputs: prediction scores of each images, eg. [0.001, 0.0034 ..., 0.3]

[batchSize, classes]

Network Strutrue

    Inputs                Linear/Function        Output
    [128, 1, 28, 28]   -> Linear(28*28, 100) -> [128, 100]  # first hidden layer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # second hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # third hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 10)    -> [128, 10]   # Classification Layer

class FeedForwardNeuralNetwork(nn.Module):
    """
    Inputs                Linear/Function        Output
    [128, 1, 28, 28]   -> Linear(28*28, 100) -> [128, 100]  # first hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # second hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # third hidden lyaer
                       -> ReLU               -> [128, 100]  # relu activation function, may sigmoid
                       -> Linear(100, 10)    -> [128, 10]   # Classification Layer                                                          
   """
    def __init__(self, input_size, hidden_size, output_size, activation_function='RELU'):
        super(FeedForwardNeuralNetwork, self).__init__()
        self.use_dropout = False
        self.use_bn = False
        self.hidden1 = nn.Linear(input_size, hidden_size)  # Linear function 1: 784 --> 100 
        self.hidden2 = nn.Linear(hidden_size, hidden_size) # Linear function 2: 100 --> 100
        self.hidden3 = nn.Linear(hidden_size, hidden_size) # Linear function 3: 100 --> 100
        # Linear function 4 (readout): 100 --> 10
        self.classification_layer = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(p=0.5) # Drop out with prob = 0.5
        self.hidden1_bn = nn.BatchNorm1d(hidden_size) # Batch Normalization 
        self.hidden2_bn = nn.BatchNorm1d(hidden_size)
        self.hidden3_bn = nn.BatchNorm1d(hidden_size)
        
        # Non-linearity
        if activation_function == 'SIGMOID':
            self.activation_function1 = nn.Sigmoid()
            self.activation_function2 = nn.Sigmoid()
            self.activation_function3 = nn.Sigmoid()
        elif activation_function == 'RELU':
            self.activation_function1 = nn.ReLU()
            self.activation_function2 = nn.ReLU()
            self.activation_function3 = nn.ReLU()
        
    def forward(self, x):
        """Defines the computation performed at every call.
           Should be overridden by all subclasses.
        Args:
            x: [batch_size, channel, height, width], input for network
        Returns:
            out: [batch_size, n_classes], output from network
        """
        
        x = x.view(x.size(0), -1) # flatten x in [128, 784]
        out = self.hidden1(x)
        out = self.activation_function1(out) # Non-linearity 1
        if self.use_bn == True:
            out = self.hidden1_bn(out)
        out = self.hidden2(out)
        out = self.activation_function2(out)
        if self.use_bn == True:
            out = self.hidden2_bn(out)
        out = self.hidden3(out)
        if self.use_bn == True:
            out = self.hidden3_bn(out)
        out = self.activation_function3(out)
        if self.use_dropout == True:
            out = self.dropout(out)
        out = self.classification_layer(out)
        return out
    
    def set_use_dropout(self, use_dropout):
        """Whether to use dropout. Auxiliary function for our exp, not necessary.
        Args:
            use_dropout: True, False
        """
        self.use_dropout = use_dropout
        
    def set_use_bn(self, use_bn):
        """Whether to use batch normalization. Auxiliary function for our exp, not necessary.
        Args:
            use_bn: True, False
        """
        self.use_bn = use_bn
        
    def get_grad(self):
        """Return average grad for hidden2, hidden3. Auxiliary function for our exp, not necessary.
        """
        hidden2_average_grad = np.mean(np.sqrt(np.square(self.hidden2.weight.grad.detach().numpy())))
        hidden3_average_grad = np.mean(np.sqrt(np.square(self.hidden3.weight.grad.detach().numpy())))
        return hidden2_average_grad, hidden3_average_grad

3. Training

We would define training function here. Additionally, hyper-parameters, loss function, metric would be included here too.

3.1 Pre-set hyper-parameters

setting hyperparameters like below

hyper paprameters include following part

learning rate: usually we start from a quite bigger lr like 1e-1, 1e-2, 1e-3, and slow lr as epoch moves.
n_epochs: training epoch must set large so model has enough time to converge. Usually, we will set a quite big epoch at the first training time.
batch_size: usually, bigger batch size mean's better usage of GPU and model would need less epoches to converge. And the exponent of 2 is used, eg. 2, 4, 8, 16, 32, 64, 128. 256.

### Hyper parameters

batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# create a model object
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)

3.2 Initialize model parameters

Pytorch provide default initialization (uniform intialization) for linear layer. But there is still some useful intialization method.

Read more about initialization from this link

    torch.nn.init.normal_
    torch.nn.init.uniform_
    torch.nn.init.constant_
    torch.nn.init.eye_
    torch.nn.init.xavier_uniform_
    torch.nn.init.xavier_normal_
    torch.nn.init.kaiming_uniform_

3.2.1 Initialize normal parameters

def show_weight_bias(model):
    """Show some weights and bias distribution every layers in model. 
       !!YOU CAN READ THIS CODE LATER!! 
    """
    # Create a figure and a set of subplots
    fig, axs = plt.subplots(2,3, sharey=False, tight_layout=True)
    
    # weight and bias for every hidden layer
    h1_w = model.hidden1.weight.detach().numpy().flatten()
    h1_b = model.hidden1.bias.detach().numpy().flatten()
    h2_w = model.hidden2.weight.detach().numpy().flatten()
    h2_b = model.hidden2.bias.detach().numpy().flatten()
    h3_w = model.hidden3.weight.detach().numpy().flatten()
    h3_b = model.hidden3.bias.detach().numpy().flatten()
    
    axs[0,0].hist(h1_w)
    axs[0,1].hist(h2_w)
    axs[0,2].hist(h3_w)
    axs[1,0].hist(h1_b)
    axs[1,1].hist(h2_b)
    axs[1,2].hist(h3_b)
    
    # set title for every sub plots
    axs[0,0].set_title('hidden1_weight')
    axs[0,1].set_title('hidden2_weight')
    axs[0,2].set_title('hidden3_weight')
    axs[1,0].set_title('hidden1_bias')
    axs[1,1].set_title('hidden2_bias')
    axs[1,2].set_title('hidden3_bias')

# Show default initialization for every hidden layer by pytorch
# it's uniform distribution 
show_weight_bias(model)

image

# If you want to use other intialization method, you can use code below
# and define your initialization below

def weight_bias_reset(model):
    """Custom initialization, you can use your favorable initialization method.
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            # initialize linear layer with mean and std
            mean, std = 0, 0.1 
            
            # Initialization method
            torch.nn.init.normal_(m.weight, mean, std)
            torch.nn.init.normal_(m.bias, mean, std)
            
#             Another way to initialize
#             m.weight.data.normal_(mean, std)
#             m.bias.data.normal_(mean, std)

weight_bias_reset(model) # reset parameters for each hidden layer
show_weight_bias(model) # show weight and bias distribution, normal distribution now.

image

3.2.2 Problem 1: Other initialization methods

Initialize weights using torch.nn.init.constant, torch.nn.init.xavier_uniform_, torch.nn.init_xavier_normal_. The model is initialized with these functions correspondingly, and the parameter distribution of the model's hidden layer need to be shown using show_weight_bias (There should be six cells here.). About '_X', 'X_' and '_X_' function in Python, view here.

# TODO
def weight_bias_reset_constant(model):
    """
    Constant initalization
    """ 
    for m in model.modules():
        if isinstance(m, nn.Linear):
            val = 0
            torch.nn.init.constant(m.weight, val)
            torch.nn.init.constant(m.bias, val)
            pass

# TODO
weight_bias_reset_constant(model) # reset parameters for each hidden layer
show_weight_bias(model) # show weight and bias distribution, normal distribution now.
# Reset parameters and show their distribution

/Users/nino/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:9: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  if __name__ == '__main__':
/Users/nino/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:10: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  # Remove the CWD from sys.path while we load stuff.
/Users/nino/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:2299: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "

image

# TODO

def weight_bias_reset_xavier_uniform(model):
    """xaveir_uniform, gain=1
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            val = 0
            torch.nn.init.xavier_uniform_(m.weight, gain=1)
            torch.nn.init.constant(m.bias, val)
            pass

# TODO
weight_bias_reset_xavier_uniform(model) # reset parameters for each hidden layer
show_weight_bias(model) # show weight and bias distribution, normal distribution now.
# Reset parameters and show their distribution

/Users/nino/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:10: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  # Remove the CWD from sys.path while we load stuff.
/Users/nino/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:2299: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "

image

# TODO

def weight_bias_reset_kaiming_uniform(model):
    """kaiming_uniform, a=0,model='fan_in', non_linearity='relu'
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            val = 0
            torch.nn.init.xavier_normal_(m.weight, gain=1)
            torch.nn.init.constant(m.bias, val)
            pass

# TODO
weight_bias_reset_kaiming_uniform(model) # reset parameters for each hidden layer
show_weight_bias(model) # show weight and bias distribution, normal distribution now.
# Reset parameters and show their distribution

/Users/nino/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:10: UserWarning: nn.init.constant is now deprecated in favor of nn.init.constant_.
  # Remove the CWD from sys.path while we load stuff.
/Users/nino/anaconda3/lib/python3.7/site-packages/matplotlib/figure.py:2299: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  warnings.warn("This figure includes Axes that are not compatible "

image

3.3 Repeat over certain numbers of epoch

Shuffle whole training data

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True, **kwargs)

For each mini-batch data

load mini-batch data

for batch_idx, (data, target) in enumerate(train_loader): \
    ...

compute gradient of loss over parameters

 output = net(data) # make prediction
 loss = loss_fn(output, target)  # compute loss 
 loss.backward() # compute gradient of loss over parameters

update parameters with gradient descent

optimzer.step() # update parameters with gradient descent

3.3.1 Shuffle whole traning data

Data Loading.

Please pay attention to data augmentation.

Read more data augmentation method from this link.

torchvision.transforms.RandomVerticalFlip
torchvision.transforms.RandomHorizontalFlip
...

# define method of preprocessing data for evaluating

train_transform = transforms.Compose([
    transforms.ToTensor(), # Convert a PIL Image or numpy.ndarray to tensor.
    # Normalize a tensor image with mean 0.1307 and standard deviation 0.3081
    transforms.Normalize((0.1307,), (0.3081,))
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# use MNIST provided by torchvision

# torchvision.datasets provide MNIST dataset for classification

train_dataset = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform,
                            download=True)

test_dataset = torchvision.datasets.MNIST(root='./data', 
                           train=False, 
                           transform=test_transform,
                           download=False)

# pay attention to this, train_dataset doesn't load any data
# It just defined some method and store some message to preprocess data
train_dataset

Dataset MNIST
    Number of datapoints: 60000
    Split: train
    Root Location: ./data
    Transforms (if any): Compose(
                             ToTensor()
                             Normalize(mean=(0.1307,), std=(0.3081,))
                         )
    Target Transforms (if any): None

# Data loader. 

# Combines a dataset and a sampler, 
# and provides single- or multi-process iterators over the dataset.

train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

# functions to show an image

def imshow(img):
    """show some imgs in datasets
        !!YOU CAN READ THIS CODE LATER!! """
    
    npimg = img.numpy() # convert tensor to numpy
    plt.imshow(np.transpose(npimg, (1, 2, 0))) # [channel, height, width] -> [height, width, channel]
    plt.show()

# get some random training images by batch

dataiter = iter(train_loader)
images, labels = dataiter.next() # get a batch of images

# show images
imshow(torchvision.utils.make_grid(images))

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

image

3.3.2 & 3.3.3 compute gradient of loss over parameters & update parameters with gradient descent

def train(train_loader, model, loss_fn, optimizer, get_grad=False):
    """train model using loss_fn and optimizer. When thid function is called, model trains for one epoch.
    Args:
        train_loader: train data
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
        optimizer: optimize the loss function
        get_grad: True, False
    Returns:
        total_loss: loss
        average_grad2: average grad for hidden 2 in this epoch
        average_grad3: average grad for hidden 3 in this epoch
    """
    
    # set the module in training model, affecting module e.g., Dropout, BatchNorm, etc.
    model.train()
    
    total_loss = 0
    grad_2 = 0.0 # store sum(grad) for hidden 3 layer
    grad_3 = 0.0 # store sum(grad) for hidden 3 layer
    
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad() # clear gradients of all optimized torch.Tensors'
        outputs = model(data) # make predictions 
        loss = loss_fn(outputs, target) # compute loss 
        total_loss += loss.item() # accumulate every batch loss in a epoch
        loss.backward() # compute gradient of loss over parameters 
        
        if get_grad == True:
            g2, g3 = model.get_grad() # get grad for hiddern 2 and 3 layer in this batch
            grad_2 += g2 # accumulate grad for hidden 2
            grad_3 += g3 # accumulate grad for hidden 2
            
        optimizer.step() # update parameters with gradient descent 
            
    average_loss = total_loss / batch_idx # average loss in this epoch
    average_grad2 = grad_2 / batch_idx # average grad for hidden 2 in this epoch
    average_grad3 = grad_3 / batch_idx # average grad for hidden 3 in this epoch
    
    return average_loss, average_grad2, average_grad3

def evaluate(loader, model, loss_fn):
    """test model's prediction performance on loader.  
    When thid function is called, model is evaluated.
    Args:
        loader: data for evaluation
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
    Returns:
        total_loss
        accuracy
    """
    
    # context-manager that disabled gradient computation
    with torch.no_grad():
        
        # set the module in evaluation mode
        model.eval()
        
        correct = 0.0 # account correct amount of data
        total_loss = 0  # account loss
        
        for batch_idx, (data, target) in enumerate(loader):
            outputs = model(data) # make predictions 
            # return the maximum value of each row of the input tensor in the 
            # given dimension dim, the second return vale is the index location
            # of each maxium value found(argmax)
            _, predicted = torch.max(outputs, 1)
            # Detach: Returns a new Tensor, detached from the current graph.
            #The result will never require gradient.
            correct += (predicted == target).sum().detach().numpy()
            loss = loss_fn(outputs, target)  # compute loss 
            total_loss += loss.item() # accumulate every batch loss in a epoch
            
        accuracy = correct*100.0 / len(loader.dataset) # accuracy in a epoch
        
    return total_loss, accuracy

Define function fit and use train_epoch and test_epoch

def fit(train_loader, val_loader, model, loss_fn, optimizer, n_epochs, get_grad=False):
    """train and val model here, we use train_epoch to train model and 
    val_epoch to val model prediction performance
    Args: 
        train_loader: train data
        val_loader: validation data
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
        optimizer: optimize the loss function
        n_epochs: training epochs
        get_grad: Whether to get grad of hidden2 layer and hidden3 layer
    Returns:
        train_accs: accuracy of train n_epochs, a list
        train_losses: loss of n_epochs, a list
    """
    
    grad_2 = [] # save grad for hidden 2 every epoch
    grad_3 = [] # save grad for hidden 3 every epoch
    
    train_accs = [] # save train accuracy every epoch
    train_losses = [] # save train loss every epoch
    
    # addition
    val_accs = [] # save test accuracy every epoch
    val_losses = [] # save test loss every epoch
    
    for epoch in range(n_epochs): # train for n_epochs 
        # train model on training datasets, optimize loss function and update model parameters 
        train_loss, average_grad2, average_grad3 = train(train_loader, model, loss_fn, optimizer, get_grad)
        
        # evaluate model performance on train dataset
        _, train_accuracy = evaluate(train_loader, model, loss_fn)
        message = 'Epoch: {}/{}. Train set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, train_loss, train_accuracy)
        print(message)
    
        # save train_losses, train_accuracy, grad
        train_accs.append(train_accuracy)
        train_losses.append(train_loss)
        grad_2.append(average_grad2)
        grad_3.append(average_grad3)
    
        # evaluate model performance on val dataset
        val_loss, val_accuracy = evaluate(val_loader, model, loss_fn)
        val_loss /= len(test_loader)
        message = 'Epoch: {}/{}. Validation set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, val_loss, val_accuracy)
        
        # save test_losses, test_accuracy
        val_accs.append(val_accuracy)
        val_losses.append(val_loss)
        print(message)
        
        
    # Whether to get grad for showing
    if get_grad == True:
        fig, ax = plt.subplots() # add a set of subplots to this figure
        ax.plot(grad_2, label='Gradient for Hidden 2 Layer') # plot grad 2 
        ax.plot(grad_3, label='Gradient for Hidden 3 Layer') # plot grad 3 
        plt.ylim(top=0.004)
        # place a legend on axes
        legend = ax.legend(loc='best', shadow=True, fontsize='x-large')
    return train_accs, train_losses, val_losses, val_accs

def show_curve(ys_train, ys_test, title):
    """plot curlve for Loss and Accuacy
    
    !!YOU CAN READ THIS LATER, if you are interested
    
    Args:
        ys: loss or acc list
        title: Loss or Accuracy
    """
    x = np.array(range(len(ys_train)))
    y_train = np.array(ys_train)
    y_test = np.array(ys_test)
    plt.plot(x, y_train, label='train', c='b')
    plt.plot(x, y_test, label='test', c='r')
    plt.axis()
    plt.title('{} Curve:'.format(title))
    plt.xlabel('Epoch')
    plt.ylabel('{} Value'.format(title))
    plt.legend()
    plt.show()

3.3.3 Problem 2

Run the fit function to answer the question of whether the model is trained to overfit based on the accuracy of the training set at the end. Use the show_curve function provided to plot the changes of loss and accuracy in the training.

Hints: Because jupyter has context for variables, the model, the optimizer, needs to be re-declared. The model and optimizer can be redefined using the following code. Note that the default initialization is used here.

Running the cells below, the two curves of train set and the evaluation of the test set are shown correspondingly.

image

Apparently, this model isn't trained to be overfit.

Because the final accuracy of test set 92.0400 (validation set) is relatively high, showing great performance under the training of training set.
After some modifications, I draw the trends of training to make it possible for us see the accuracy of each epoch. We can clearly see that the trends of two sets do not deviate.

### Hyper parameters
batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)

train_accs, train_losses, test_losses, test_accs = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)

Epoch: 1/5. Train set: Average loss: 1.8414, Accuracy: 77.0133
Epoch: 1/5. Validation set: Average loss: 0.8558, Accuracy: 77.3200
Epoch: 2/5. Train set: Average loss: 0.5834, Accuracy: 87.0433
Epoch: 2/5. Validation set: Average loss: 0.4298, Accuracy: 87.2900
Epoch: 3/5. Train set: Average loss: 0.3840, Accuracy: 89.7033
Epoch: 3/5. Validation set: Average loss: 0.3398, Accuracy: 89.6800
Epoch: 4/5. Train set: Average loss: 0.3219, Accuracy: 91.0183
Epoch: 4/5. Validation set: Average loss: 0.2970, Accuracy: 91.2300
Epoch: 5/5. Train set: Average loss: 0.2858, Accuracy: 92.0283
Epoch: 5/5. Validation set: Average loss: 0.2669, Accuracy: 92.0200

# TODO
show_curve(train_accs, test_accs, 'Accs')
show_curve(train_losses, test_losses, 'Losses')

image

3.3.4 Problem 3

Set n_epochs to 10 to observe whether the model can achieve overfitting on the training set, and use show_curve to draw the diagram. The learning rate can be appropriately adjusted to achieve the over-fitting of the model in the 5 epochs internal training set. Choose an appropriate learing rate, training model, and use show_curve to draw pictures to verify your learning rate

Hints: Because jupyter has context on variables, the model and the optimizer needs to be restated. The model and optimizer can be redefined using the following code. Note that the default initialization is used here.

Although there is no direct link between learning rate and overfit, we can still observe overfitting under a certain lr. First, let's see some examples:

When lr=0.75~0.8, the model is overfitting.

image

test_losses increases when train_losses decreases, indicating that this model is overfitting.

Notice: Under same circumstances, the model will not always show overfitting. So MNIST is not an approperiate dataset of overfitting. (The samples in it is pretty good!)

### n_epoch = 10
batch_size = 128 # batch size is 128
n_epochs = 10 # train for 10 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm)

# TODO
train_accs, train_losses, test_losses, test_accs = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)

Epoch: 1/10. Train set: Average loss: 1.8201, Accuracy: 78.0017
Epoch: 1/10. Validation set: Average loss: 0.8345, Accuracy: 79.0000
Epoch: 2/10. Train set: Average loss: 0.5614, Accuracy: 87.2917
Epoch: 2/10. Validation set: Average loss: 0.4105, Accuracy: 87.4200
Epoch: 3/10. Train set: Average loss: 0.3783, Accuracy: 89.4333
Epoch: 3/10. Validation set: Average loss: 0.3371, Accuracy: 89.7800
Epoch: 4/10. Train set: Average loss: 0.3224, Accuracy: 90.8267
Epoch: 4/10. Validation set: Average loss: 0.2963, Accuracy: 91.0500
Epoch: 5/10. Train set: Average loss: 0.2864, Accuracy: 91.8383
Epoch: 5/10. Validation set: Average loss: 0.2665, Accuracy: 92.0500
Epoch: 6/10. Train set: Average loss: 0.2590, Accuracy: 92.6567
Epoch: 6/10. Validation set: Average loss: 0.2432, Accuracy: 92.6200
Epoch: 7/10. Train set: Average loss: 0.2365, Accuracy: 93.3117
Epoch: 7/10. Validation set: Average loss: 0.2240, Accuracy: 93.2600
Epoch: 8/10. Train set: Average loss: 0.2174, Accuracy: 93.8033
Epoch: 8/10. Validation set: Average loss: 0.2082, Accuracy: 93.6900
Epoch: 9/10. Train set: Average loss: 0.2010, Accuracy: 94.3150
Epoch: 9/10. Validation set: Average loss: 0.1945, Accuracy: 94.0900
Epoch: 10/10. Train set: Average loss: 0.1866, Accuracy: 94.7133
Epoch: 10/10. Validation set: Average loss: 0.1826, Accuracy: 94.3700

# TODO
show_curve(train_accs, test_accs, 'Accs')
show_curve(train_losses, test_losses, 'Losses')

image

### To overfit

batch_size = 128 # batch size is 128
n_epochs = 5 # train for 5 epochs
#learning_rate = 0.01 # learning rate is 0.01
learning_rate = 0.75 # overfitting learning rate
input_size = 28*28 # input image has size 28x28
hidden_size = 100 # hidden neurons is 100 for each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use
get_grad = False # not to obtain grad

# declare a model
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# Cross entropy
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm) 

train_accs, train_losses, test_losses, test_accs = fit(train_loader, test_loader, model, loss_fn, optimizer, n_epochs, get_grad)

Epoch: 1/5. Train set: Average loss: 0.8179, Accuracy: 86.2683
Epoch: 1/5. Validation set: Average loss: 0.4780, Accuracy: 86.3000
Epoch: 2/5. Train set: Average loss: 0.2292, Accuracy: 94.1483
Epoch: 2/5. Validation set: Average loss: 0.2332, Accuracy: 93.5000
Epoch: 3/5. Train set: Average loss: 0.1527, Accuracy: 94.5600
Epoch: 3/5. Validation set: Average loss: 0.2268, Accuracy: 93.6900
Epoch: 4/5. Train set: Average loss: 0.1276, Accuracy: 95.8450
Epoch: 4/5. Validation set: Average loss: 0.1981, Accuracy: 94.8500
Epoch: 5/5. Train set: Average loss: 0.1082, Accuracy: 96.3633
Epoch: 5/5. Validation set: Average loss: 0.1864, Accuracy: 95.2300

# TODO
show_curve(train_accs, test_accs, 'Accs')
show_curve(train_losses, test_losses, 'Losses')

image

最后编辑于：2019.08.02 09:06:15

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 218,546评论 6赞 507
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 93,224评论 3赞 395
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 164,911评论 0赞 354
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 58,737评论 1赞 294
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 67,753评论 6赞 392
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 51,598评论 1赞 305
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,338评论 3赞 418
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,249评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 45,696评论 1赞 314
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 37,888评论 3赞 336
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,013评论 1赞 348
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 35,731评论 5赞 346
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,348评论 3赞 330
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 31,929评论 0赞 22
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,048评论 1赞 270
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,203评论 3赞 370
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 44,960评论 2赞 355

深度学习笔记（五）—— 分类网络的训练问题-1

Outline

1. Setup

1.1 Required Module

1.2 Common Setup

2. Classfication Model

2.1 Short indroduction of MNIST

2.2 Define A FeedForward Neural Network

2.2.1 Activation Function

2.2.1.1 ReLU

2.2.1.2 Sigmoid

2.2.2 Network's Input and output

3. Training

3.1 Pre-set hyper-parameters

3.2 Initialize model parameters

3.2.1 Initialize normal parameters

3.2.2 Problem 1: Other initialization methods

3.3 Repeat over certain numbers of epoch

3.3.1 Shuffle whole traning data

3.3.2 & 3.3.3 compute gradient of loss over parameters & update parameters with gradient descent

3.3.3 Problem 2

3.3.4 Problem 3

推荐阅读更多精彩内容