摘要

这是我学习了斯坦福大学的cs231n课程有关神经网络部分的学习笔记，是我对自己的知识的复习和自己编程过程中出错的问题总结。主要按照作业实现的思路进行总结。文中的图片都来自课程ppt，TODO代码90%是我自己编写，剩下的部分参考网上其他网友。

neural network

两层神经网络

作业要求首先实现一个两层的简单神经网络。

正向传播：

需要注意的地方：

由于使用的是softmax函数计算loss，所以需要计算指数，容易造成数值爆炸，所以对每个样本同时减去该样本特征值中国的最大值再计算指数，而减去一个常数对计算梯度无影响。
在求每个样本最大值时

a = np.max(scores, axis=1,keepdims=True)

此处的keepdims=True必不可少，否则无法完成broadcast。

# Compute the forward pass
    scores = None
    #############################################################################
    # TODO: Perform the forward pass, computing the class scores for the input. #
    # Store the result in the scores variable, which should be an array of      #
    # shape (N, C).                                                             #
    #############################################################################
    h1=X.dot(W1)+b1.T
    h12=np.maximum(0,h1)
    h1=np.maximum(0,h1)
    scores=h1.dot(W2)+b2.T
       
    # If the targets are not given then jump out, we're done
    if y is None:
      return scores

    # Compute the loss
    loss = None
    #############################################################################
    # TODO: Finish the forward pass, and compute the loss. This should include  #
    # both the data loss and L2 regularization for W1 and W2. Store the result  #
    # in the variable loss, which should be a scalar. Use the Softmax           #
    # classifier loss. So that your results match ours, multiply the            #
    # regularization loss by 0.5                                                #
    #############################################################################
    pass
    #此处为了防止指数运算数值爆炸，将得到的scores全部减去对应样本的最大值，
    #再进行指数操作，而减去一个常数对计算梯度无影响
    a = np.max(scores, axis=1,keepdims=True)  
#此处的keepdims必不可少，否则无法完成broadcast
    scores -= a
    a = np.exp(scores)

    c1 = np.sum(a, axis=1)
    c2 = 1 / c1
    c3 = a[np.arange(a.shape[0]), y]
    b1 = c2 * c3
    L1 = np.log(b1)

    loss=-np.sum(L1)
    #loss = -np.sum(np.log(a[np.arange(a.shape[0]), y] / np.sum(a, axis=1)))
    loss /= X.shape[0]
    loss += 0.5* reg * np.sum(W1 * W1)
    loss +=0.5* reg * np.sum(W2 * W2)

反向传播：

利用反向传播一步步往回计算梯度。这里的代码用了两次的db1，变量名定义的不好，与真正的b1的梯度混淆了。
需要注意的地方：

当计算dc1的时候，多做了几步将dc1从向量转化为二维矩阵，同样是因为向量不支持broadcast。感觉我用的方法很笨，希望能找到更好的方法。
在计算da的时候需要特别注意，要将a[y]和其他的元素分开计算梯度，

softmax公式

从公式中可以看出，a[y]就是分子的部分，分子分母都用到了a[y]，所以需要将这两部分的梯度相加。
可以看下面SVM的传播图，W有两部分传播，所以计算梯度时也需要将两部分相加，原因与softmax一样。

SVM传播图

#将两部分梯度相加 
da[:,np.arange(da.shape[1])]=dc1
 da[np.arange(da.shape[0]),y]+=dc3

代码：

# Backward pass: compute gradients
    grads = {}
    #############################################################################
    # TODO: Compute the backward pass, computing the derivatives of the weights #
    # and biases. Store the results in the grads dictionary. For example,       #
    # grads['W1'] should store the gradient on W1, and be a matrix of same size #
    #############################################################################
    pass
    db1=-1/b1*1/X.shape[0]
    dc2=db1*c3
    dc3=db1*c2
    dc1=dc2*(-1/np.square(c1))
    z1=np.zeros([dc1.shape[0],1])       
#这几步是将dc1从向量转化为二维矩阵，因为向量不支持broadcast
    z1[:,0]=dc1
    dc1=z1

    da=np.zeros(a.shape)
    da[:,np.arange(da.shape[1])]=dc1
    da[np.arange(da.shape[0]),y]+=dc3             
 #注意！对于求a[y]的导数时，c1和c3都用到了a[y]，所以需要相加！
    dscores=a*da
    dW2=h12.T.dot(dscores)
    dh12=dscores.dot(W2.T)
    dh1=dh12*(h1>0)
    dW1=X.T.dot(dh1)
    db2=np.sum(dscores.T,axis=1)
    db1=np.sum(dh1.T,axis=1)

    dW1+=reg*W1
    dW2+=reg*W2

    grads['W1']=dW1
    grads['W2']=dW2
    grads['b1']=db1
    grads['b2']=db2

训练网络

采用的是部分梯度下降法，所以需要先取样。

inde = np.random.choice(xrange(X.shape[0]), batch_size, replace=True)
X_batch = X[inde, :]
y_batch = y[inde]

参数更新：

#使用momentum更新参数

      mu=0.9
      v_w1=mu*v_w1-learning_rate*grads['W1']
      self.params['W1'] +=v_w1
      v_w2 = mu * v_w2 - learning_rate * grads['W2']
      self.params['W2'] += v_w2
      v_b1 = mu * v_b1 - learning_rate * grads['b1']
      self.params['b1'] += v_b1
      v_b2 = mu * v_b2 - learning_rate * grads['b2']
      self.params['b2'] += v_b2

    #使用SGD更新参数
      # self.params['W1'] +=-learning_rate*grads['W1']
      # self.params['W2'] += -learning_rate*grads['W2']
      # self.params['b1'] += -learning_rate*grads['b1']
      # self.params['b2'] +=-learning_rate*grads['b2']

用了momentum和sgd两种，更多的方法在下面多层神经网络会有详细说明。

以上所有部分是作业中Neural_net.py文件中需要我们实现的代码，全部完整代码将在最后贴出。

筛选参数

完成了上面每个部分后就可以用数据进行训练了，下面是ipython中筛选超参数选出最好的参数代码：

best_net = None # store the best model into this   
#best parameters by Yan Wei:lr 0.000150 reg 0.040000 hs 100  val accuracy: 0.514000


#################################################################################
# TODO: Tune hyperparameters using the validation set. Store your best trained  #
# model in best_net.                                                            #
#                                                                               #
# To help debug your network, it may help to use visualizations similar to the  #
# ones we used above; these visualizations will have significant qualitative    #
# differences from the ones we saw above for the poorly tuned network.          #
#                                                                               #
# Tweaking hyperparameters by hand can be fun, but you might find it useful to  #
# write code to sweep through possible combinations of hyperparameters          #
# automatically like we did on the previous exercises.                          #
#################################################################################
pass
input_size = 32 * 32 * 3
hidden_size_2 = 50
num_classes = 10
results = {}
best_val = -1
best_stats=-1
learning_rates = [1.5e-4,2e-4,3e-4]
regularization_strengths = [0.02,0.03,0.04]
hidden_size_test=[100]

for lr in learning_rates:
    for reg in regularization_strengths:
        for hs in hidden_size_test:
            

            net2 = TwoLayerNet(input_size, hs, num_classes)

# Train the network
            stats2 = net2.train(X_train, y_train, X_val, y_val,
                        num_iters=1000, batch_size=200,
                        learning_rate=lr, learning_rate_decay=0.95,
                        reg=reg, verbose=True)

# Predict on the validation set
            val_acc2 = (net2.predict(X_val) == y_val).mean()
            print 'Validation accuracy: ', val_acc2
            print 'lr: %f  reg: %f hs: %d'%(lr,reg,hs)
            if val_acc2>best_val:
                best_val=val_acc2
                best_net=net2
                best_stats=stats2
            results[(lr,reg,hs)]=val_acc2

for lr, reg,hs in sorted(results):
    val_accuracy = results[(lr, reg,hs)]
    print 'lr %f reg %f hs %d  val accuracy: %f' % (
                lr, reg,hs, val_accuracy)
    
print 'best validation accuracy achieved during cross-validation: %f' % best_val

# Plot the loss function and train / validation accuracies
plt.subplot(2, 1, 1)
plt.plot(stats2['loss_history'])
plt.title('Loss history')
plt.xlabel('Iteration')
plt.ylabel('Loss')

plt.subplot(2, 1, 2)
plt.plot(stats2['train_acc_history'], label='train')
plt.plot(stats2['val_acc_history'], label='val')
plt.legend(['train', 'val'])
plt.title('Classification accuracy history')
plt.xlabel('Epoch')
plt.ylabel('Clasification accuracy')
plt.show()

下面是运行的一些结果截图：