UFLDL是吴恩达团队编写的较早的一门深度学习入门,里面理论加上练习的节奏非常好,每次都想快点看完理论去动手编写练习,因为他帮你打好了整个代码框架,也有详细的注释,所以我们只要实现一点核心的代码编写工作就行了,上手快!
我这里找不到新版对应这块的中文翻译了,-_-,趁早写一下,否则又没感觉了!
第八节是:Convolutional Neural Network(卷积神经网络)
我自认为这是整个教程中最难的一次练习,因为它综合了之前的多层神经网络,卷积和池化,不仅要写前向传播的代码,而且还要写卷积网络的反向传播(这又是这节最难的一部分),梯度检查等等,之前学习卷积神经网络的反向传播都花了2天的时间,不要说要写出代码了
在练习之前,有两节知识点:
Optimization: Stochastic Gradient Descent
Stochastic Gradient Descent即随机梯度下降,与Batch Gradient Descent不同的是,随机梯度下降(SGD)不是在整个训练集上进行一次梯度的计算,而是在一个单个训练样本上进行梯度的计算,就如下面的公式所表示的:
BGD:
SGD:
有想具体了解一下的,可以在这里看一下CS229的note中相关的部分
相比于BGD来说,SGD在用的时候也不会真的只用一个单个的训练样本计算梯度及更新参数,而是会用一个Mini-Batch来计算梯度,通常是128,256等2的幂次,相比一个来说,这样更能稳定地收敛,并且利用向量计算的高效性,在深度学习里面SGD也默认都是利用了Mini-Batch,这点会在minFuncSGD.m里面体现。
这一节还介绍了比SGD更好一点的是带Momentum(动量)的梯度下降,具体公式就是:
在代码里面也就是一两行,现在就贴出minFuncSGD.m:
function [opttheta] = minFuncSGD(funObj,theta,data,labels,...
options)
% in cnnTrain.m to call minFuncSGD function like this:
% opttheta = minFuncSGD(@(x,y,z) cnnCost(x,y,z,numClasses,filterDim,...
% numFilters,poolDim),theta,images,labels,options);
% Runs stochastic gradient descent with momentum to optimize the
% parameters for the given objective.
%
% Parameters:
% funObj - function handle which accepts as input theta,
% data, labels and returns cost and gradient w.r.t
% to theta.
% theta - unrolled parameter vector
% data - stores data in m x n x numExamples tensor
% labels - corresponding labels in numExamples x 1 vector
% options - struct to store specific options for optimization
%
% Returns:
% opttheta - optimized parameter vector
%
% Options (* required)
% epochs* - number of epochs through data
% alpha* - initial learning rate
% minibatch* - size of minibatch
% momentum - momentum constant, defualts to 0.9
%%======================================================================
%% Setup
assert(all(isfield(options,{'epochs','alpha','minibatch'})),...
'Some options not defined');
if ~isfield(options,'momentum')
options.momentum = 0.9;
end;
epochs = options.epochs;
alpha = options.alpha;
minibatch = options.minibatch;
m = length(labels); % training set size
% Setup for momentum
mom = 0.5;
momIncrease = 20;
velocity = zeros(size(theta));
%%======================================================================
%% SGD loop
it = 0; % iterations
for e = 1:epochs
% randomly permute indices of data for quick minibatch sampling
rp = randperm(m);
for s=1:minibatch:(m-minibatch+1) %234
it = it + 1;
% increase momentum after momIncrease iterations
if it == momIncrease
mom = options.momentum;
end;
% get next randomly selected minibatch
mb_data = data(:,:,rp(s:s+minibatch-1));
mb_labels = labels(rp(s:s+minibatch-1));
% evaluate the objective function on the next minibatch
[cost grad] = funObj(theta,mb_data,mb_labels);
% Instructions: Add in the weighted velocity vector to the
% gradient evaluated above scaled by the learning rate.
% Then update the current weights theta according to the
% sgd update rule
%%% YOUR CODE HERE %%%
velocity = mom * velocity + alpha * grad;
theta = theta - velocity;
fprintf('Epoch %d: Cost on iteration %d is %f\n',e,it,cost);
end;
% aneal learning rate by factor of two after each epoch
alpha = alpha/2.0;
end;
opttheta = theta;
end
Convolutional Neural Network
这就是教程里面介绍的卷积神经网络的第一层的架构,一些符号说明具体可以看教程这节本身,这里的前向传播跟之前的Convolution and Pooling(卷积和池化)一节非常像,所以直接调用之前的cnnConvolve、cnnPool函数就好了。
难的是此时的反向传播和全连接网络的Back Propagation不一样了,之前如果第层和层是全连接的,我们之前也提过,就是按这样进行误差传递和梯度计算:
但如果第层是一个卷积和下采样层,误差传递是这样的:
梯度计算是这样的:
这一段着实比较难理解,第一下看的话,要仔细理解教程的话,我建议你直接读我的代码,也可以参考一下这几篇博客和内容:
1、这一篇用图来介绍反向传播在卷积神经网络里面也可以表示为卷积过程(好像国内看不了)
2、这篇是关联度最高的,可以着重看一下这篇,因为它的PPT最后写的参考文献就是UFLDL,但不能下载
3、这篇写卷积神经网络的反向传播写的符号标记也和UFLDL颇为接近,推导说明非常详细
然后就是我的cnnCost.m代码:
function [cost, grad, preds] = cnnCost(theta,images,labels,numClasses,...
filterDim,numFilters,poolDim,pred)
% Calcualte cost and gradient for a single layer convolutional
% neural network followed by a softmax layer with cross entropy
% objective.
%
% Parameters:
% theta - unrolled parameter vector
% images - stores images in imageDim x imageDim x numImges
% array
% numClasses - number of classes to predict
% filterDim - dimension of convolutional filter
% numFilters - number of convolutional filters
% poolDim - dimension of pooling area
% pred - boolean only forward propagate and return
% predictions
%
%
% Returns:
% cost - cross entropy cost
% grad - gradient with respect to theta (if pred==False)
% preds - list of predictions for each example (if pred==True)
if ~exist('pred','var') % 默认就是false
pred = false;
end;
imageDim = size(images,1); % height/width of image
numImages = size(images,3); % number of images
%% Reshape parameters and setup gradient matrices
% Wc is filterDim x filterDim x numFilters parameter matrix
% bc is the corresponding bias
% Wd is numClasses x hiddenSize parameter matrix where hiddenSize
% is the number of output units from the convolutional layer
% bd is corresponding bias
[Wc, Wd, bc, bd] = cnnParamsToStack(theta,imageDim,filterDim,numFilters,...
poolDim,numClasses);
% Same sizes as Wc,Wd,bc,bd. Used to hold gradient w.r.t above params.
Wc_grad = zeros(size(Wc));
Wd_grad = zeros(size(Wd));
bc_grad = zeros(size(bc));
bd_grad = zeros(size(bd));
%%======================================================================
%% STEP 1a: Forward Propagation
% In this step you will forward propagate the input through the
% convolutional and subsampling (mean pooling) layers. You will then use
% the responses from the convolution and pooling layer as the input to a
% standard softmax layer.
%% Convolutional Layer
% For each image and each filter, convolve the image with the filter, add
% the bias and apply the sigmoid nonlinearity. Then subsample the
% convolved activations with mean pooling. Store the results of the
% convolution in activations and the results of the pooling in
% activationsPooled. You will need to save the convolved activations for
% backpropagation.
convDim = imageDim-filterDim+1; % dimension of convolved output 20
outputDim = (convDim)/poolDim; % dimension of subsampled output 10
% convDim x convDim x numFilters x numImages tensor for storing activations
activations = zeros(convDim,convDim,numFilters,numImages);
% outputDim x outputDim x numFilters x numImages tensor for storing
% subsampled activations
activationsPooled = zeros(outputDim,outputDim,numFilters,numImages);
%%% YOUR CODE HERE %%%
activations = cnnConvolve(filterDim,numFilters,images,Wc,bc);
% pool
activationsPooled = cnnPool(poolDim,activations);
% Reshape activations into 2-d matrix, hiddenSize x numImages,
% for Softmax layer
activationsPooled = reshape(activationsPooled,[],numImages);
%% Softmax Layer
% Forward propagate the pooled activations calculated above into a
% standard softmax layer. For your convenience we have reshaped
% activationPooled into a hiddenSize x numImages matrix. Store the
% results in probs.
% numClasses x numImages for storing probability that each image belongs to
% each class.
probs = zeros(numClasses,numImages);
%%% YOUR CODE HERE %%%
out = Wd * activationsPooled;
out = bsxfun(@plus,out,bd);
% out = sigmoid(out); 之前梯度检查的时候就这里没有注释,看来还是不能用激活的
out = exp(out);
probs = bsxfun(@rdivide,out,sum(out));
preds = probs;
%%======================================================================
%% STEP 1b: Calculate Cost
% In this step you will use the labels given as input and the probs
% calculate above to evaluate the cross entropy objective. Store your
% results in cost.
cost = 0; % save objective into cost
%%% YOUR CODE HERE %%%
I = sub2ind(size(probs),labels',1:size(probs,2));
cost = (-1) * sum(log(probs(I)));
lambda = 0.0001;
weightDecayCost = (lambda/2) * (sum(Wd(:) .^ 2) + sum(Wc(:) .^ 2));
cost = cost / numImages + weightDecayCost;
% Makes predictions given probs and returns without backproagating errors.
if pred
[~,preds] = max(probs,[],1);
preds = preds';
grad = 0;
return;
end;
%%======================================================================
%% STEP 1c: Backpropagation
% Backpropagate errors through the softmax and convolutional/subsampling
% layers. Store the errors for the next step to calculate the gradient.
% Backpropagating the error w.r.t the softmax layer is as usual. To
% backpropagate through the pooling layer, you will need to upsample the
% error with respect to the pooling layer for each filter and each image.
% Use the kron function and a matrix of ones to do this upsampling
% quickly.
%%% YOUR CODE HERE %%%
hAct = cell(3,1);
tabels = zeros(size(probs));
tabels(I) = 1;
for l = 3:-1:2 % 这里不像之前的有ei.num_layer,只能人工填3
if(l == 3)
hAct{l}.delta = -(tabels - probs); % 输出层使用softmax的损失函数,所以和二次项损失函数不同,其他的都是一样的
else
% hAct{l}.delta = (Wd'* hAct{l+1}.delta) .* (activationsPooled
% .*(1- activationsPooled)); % 不能乘后面激活函数的导数
hAct{l}.delta = (Wd'* hAct{l+1}.delta);
end
end
hAct{2}.delta = reshape(hAct{2}.delta,outputDim, outputDim, numFilters, numImages);
hAct{1}.delta = zeros(convDim, convDim, numFilters, numImages);
%展开 卷积层的误差传递有些不一样
for imageNum = 1:numImages
for filterNum = 1:numFilters
e = hAct{2}.delta(:, :, filterNum, imageNum);
hAct{1}.delta(:, :, filterNum, imageNum) = (1/poolDim^2) * kron(e, ones(poolDim));
end
end
hAct{1}.delta = hAct{1}.delta .* activations .* (1 - activations);
%%======================================================================
%% STEP 1d: Gradient Calculation
% After backpropagating the errors above, we can use them to calculate the
% gradient with respect to all the parameters. The gradient w.r.t the
% softmax layer is calculated as usual. To calculate the gradient w.r.t.
% a filter in the convolutional layer, convolve the backpropagated error
% for that filter with each image and aggregate over images.
%%% YOUR CODE HERE %%%
Wd_grad = (1/numImages) * hAct{3}.delta * activationsPooled'+lambda * Wd;
bd_grad = (1/numImages).*sum(hAct{3}.delta, 2);
for filterNum = 1 : numFilters
for imageNum = 1 : numImages
Wc_grad(:, :, filterNum) = Wc_grad(:, :, filterNum) + conv2(images(:, :, imageNum), rot90(hAct{1}.delta(:, :, filterNum, imageNum), 2), 'valid');
end
Wc_grad(:, :, filterNum) = (1/numImages) * Wc_grad(:, :, filterNum);
end
Wc_grad = Wc_grad + lambda * Wc;
for filterNum = 1 : numFilters
e = hAct{1}.delta(:, :, filterNum, :);
bc_grad(filterNum) = (1/numImages) * sum(e(:));
end
%% Unroll gradient into grad vector for minFunc
grad = [Wc_grad(:) ; Wd_grad(:) ; bc_grad(:) ; bd_grad(:)];
end
运行结果
i5 6500U跑出的结果:
i7 8代跑出的结果:
两者除了时间上有点差别外,准确率相差不大
参考:https://blog.csdn.net/lingerlanlan/article/details/41390443
有理解不到位之处,还请指出,有更好的想法,可以在下方评论交流!