这篇文章讲的是用softmax分类器来实现分类任务。其实softmax和svm的分类效果在很多情况下都是差不多的,只是它们的各自数学理解不大一样。softmax是以概率的形式来呈现,最后的输出都是百分比的形式。
有一篇比较好的文章是关于softmax和logistic的介绍和联系的,看完之后就可以理解,其实softmax只是在概率计算上和二分类的logistic不同罢了。http://blog.csdn.net/zhangliyao22/article/details/48379291
还有另外一篇:http://www.cnblogs.com/guyj/p/3800519.html
loss function:
其实对于 f(x, W) = Wx 来说没有变,只是interpretion变为了以概率模型为解释,那么随之而来的是loss function的改变。
个人比较喜欢的是用熵来解释softmax的意义:
进一步解释的话就是说,对于一个数据,它分别计算出了相对n个类的score。但是“true"数据分布,其实就是一个[0,0,...0,1,0,...0,0]。根据截图中的公式,不难把softmax的公式和上图所示的交叉熵的公式互相转换并理解出来了。由于对于真实的分布,除了yi之外的所有类都应该为0,那么最后的公式就是 Li = - ( 1 * log (score(yi) / sum_score) ) 。。。易知分类正确的情况下这个loss function的值为0。也就是理想情况中的最小值,在训练时我们要最小化交叉熵代价函数。
gradient
此处我参考了一篇写的非常非常好的博客http://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/
结论:
这个结果是对softmax函数求导,也就是loss function的equivalent 表达式的前半部分(没负号),而不是加了log之后的...其中Si,Sj指的都是由矩阵乘法Wx计算得到并由softmax函数化为0-1之间的概率的score,i代表的是当前正在计算这个输入数据对应于第i个类的概率,j代表的是当前数据求对应第j个a值的偏导。这个计算得到的表达式不是最终的dL/dW,这只是dP/dA的表达,这个A是直接由矩阵乘法Wx得到的还没有映射为概率的值,而P则是softmax函数表达式。本身由A映射为P是一个N->N的映射,Jacobian为一个N by N的矩阵。(N代表类别个数)根据chain rule,想得到dL/dW的话还需要求dL/dP和dP/dW。
详细推导如下:
代码如下:
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
#############################################################################
# TODO: Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
for i in xrange(num_train):
scores = X[i].dot(W)
log_c = np.max(scores) # this is for computation stability...
p = []
for j in xrange(num_classes):
p.append(np.exp(scores[j] - log_c))
loss += -np.log(p[y[i]]/np.sum(p))
for j in xrange(num_classes):
dW[:,j] += (p[j]/np.sum(p) - (j==y[i]))*X[i,:]
loss /= num_train
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train
dW += reg*W
#############################################################################
# END OF YOUR CODE #
#############################################################################
return loss, dW
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorized version.
Inputs and outputs are the same as softmax_loss_naive.
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_train = X.shape[0]
num_classes = W.shape[1]
#############################################################################
# TODO: Compute the softmax loss and its gradient using no explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
scores = X.dot(W)
# log_c = np.max(scores, axis=1).reshape(X.shape[0], 1)
scores -= np.max(scores, axis=1).reshape(X.shape[0], 1)
scores_exp = np.exp(scores)
sum_p = np.sum(scores_exp, axis=1).reshape(X.shape[0], 1)
p_yi = scores_exp[:, y]
p = scores_exp/sum_p
loss = np.mean(-np.log( p_yi/sum_p))
binary = np.zeros(p.shape)
print binary.shape
binary[range(binary.shape[0]), y] = 1
print binary
dW = np.dot(X.transpose(), p - binary)
loss += 0.5 * reg * np.sum(W * W)
dW /= num_train
dW += reg*W
#############################################################################
# END OF YOUR CODE #
#############################################################################
return loss, dW