cs231n - assignment1 - knn

跟着cs231n assignment1的knn部分的notebook引导，把这个作业做完了。knn的算法本身很简单，习题主要目的是希望让学习者对下面几个方面有所熟悉：

一些numpy api的使用
使用numpy矩阵运算提高算法效率的体会
使用cross validation的方法，来选择hyper-parameter（这个习题中是knn算法中的k值）

虽然算法原理很简单，不过实际完成时还是遇到不少问题，花了3、4个小时，主要是对numpy的api不熟悉，一边查资料，一边写简单的demo代码验证。同时在实现方法 def compute_distances_no_loops(self, X)时，由于对矩阵数学公式不了解，也花了不少时间。下面顺着该notebook的顺序，记录下本次作业的具体过程。

一、compute_distances_two_loops

首先是使用最简单的两层for循环，计算test样本与training样本的L2距离。
L2距离的定义如下：

捕获.PNG

  def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      for j in xrange(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        dists[i, j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

此算法的效率很低，从后面的习题中可以看到在我机器上的具体耗时大约为28s。

二、predict_labels

接下来，对test样本进行预测，具体是实现下面的函数

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    y_pred2 = np.zeros(num_test)
    for i in xrange(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      #closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      sorted_index = np.argsort(dists[i])
      closest_y = self.y_train[sorted_index[:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################

      # xh: can't name variable with i in the for comprehensive, as it's conflicted with above i
      # timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]
      timeLabel = sorted([(np.sum(closest_y == y_), y_) for y_ in set(closest_y)])[-1]
      y_pred[i] = timeLabel[1]

      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred

该方法就是knn算法的主要部分，对test集中每个样本，由上面得到的其与training集中每个样本的距离，找出前k个距离最近的样本，然后在这k个样本中，对相同的label进行计算，数量最多的那个label作为对test样本的预测label。

在实现这段代码时，花了不少时间，主要是错误的估计了python中for推导式中变量的作用域，第一次写的时候，使用了变量名 i 用于推导式中

timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]

结果这里i 和推导式外的上层for循环的i冲突

for i in xrange(num_test):

使得最终计算得到的准确率始终不对。查了半天才找出问题所在+_+

三、评估准确率

将上面两个步骤结合起来，得到的预测值，与test集的实际值进行比较，可以得到knn模型的准确率

dists = classifier.compute_distances_two_loops(X_test)
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)

这里得到的准确率大约为27%

Got 137 / 500 correct => accuracy: 0.274000

然后将k值设置为 k = 5 再试一次，准确率大约提升到28%

Got 143 / 500 correct => accuracy: 0.286000

四、提升距离计算的效率

习题接着分别实现了两种距离计算方法，用以展示不同算法在计算距离上的巨大效率差异。

compute_distances_one_loop
一层for循环实现：

  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      #dists[i] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
      dists[i, :] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists

compute_distances_no_loops
numpy的矩阵运算方式。

  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################

    # infer to http://blog.csdn.net/zhyh1435589631/article/details/54236643

    dists = np.sqrt(
      self.getNormMatrix(X, num_train).T + self.getNormMatrix(self.X_train, num_test) - 2 * np.dot(X, self.X_train.T))
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

  def getNormMatrix(self, x, lines_num):
    """
    Get a lines_num x size(x, 1) matrix
    """
    return np.ones((lines_num, 1)) * np.sum(np.square(x), axis=1)

这里运用了一个矩阵公式来计算test样本集和training样本集的距离矩阵。公式的推导如下。

设两个矩阵P(M, D)和C(N, D)，下图中每项代表矩阵的一行。

1.PNG

首先对P的 i 行和C的 j 行的距离公式进行展开后，重新合并

2.PNG

最后推广到矩阵形式，如下

3.PNG

最终，分别计算三种距离计算方式的耗时如下：

Two loop version took 29.605000 seconds
One loop version took 67.478000 seconds
No loop version took 0.277000 seconds

可见，利用numpy的矩阵运算效率极其的高

五、cross validation交叉验证

接着，使用cross validation的方法，来选择hyper-parameter超参数k的值。

cross validation的原理是，将training样本集分成n份（如下图中的例子，是5份），每一份叫做一个fold，然后依次迭代这n个fold，将其作为validation集合，其余的n-1个fold一起作为training集合，然后进行训练并计算准确率。
选择一组候选k值，依次迭代执行上面描述的过程，最终根据准确率，进行评估选择最合适的k值。

crossval.jpeg

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
for k in k_choices:
    for f in range(num_folds):
        X_train_tmp = np.array(X_train_folds[:f] + X_train_folds[f + 1:])
        y_train_tmp = np.array(y_train_folds[:f] + y_train_folds[f + 1:])
        X_train_tmp = X_train_tmp.reshape(-1, X_train_tmp.shape[2])
        y_train_tmp = y_train_tmp.reshape(-1)
        
        X_va = np.array(X_train_folds[f])
        y_va = np.array(y_train_folds[f])
        
        
        classifier.train(X_train_tmp, y_train_tmp)
        dists = classifier.compute_distances_no_loops(X_va)
        
        y_test_pred = classifier.predict_labels(dists, k)

        # Compute and print the fraction of correctly predicted examples
        num_correct = np.sum(y_test_pred == y_va)
        accuracy = float(num_correct) / y_va.shape[0]
        if (k in k_to_accuracies.keys()):
            k_to_accuracies[k].append(accuracy)
        else:
            k_to_accuracies[k] = []
            k_to_accuracies[k].append(accuracy)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print 'k = %d, accuracy = %f' % (k, accuracy)

输出结果是

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.252000
k = 3, accuracy = 0.281000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.290000
k = 3, accuracy = 0.281000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.285000
k = 5, accuracy = 0.290000
k = 5, accuracy = 0.303000
k = 5, accuracy = 0.284000
k = 8, accuracy = 0.270000
k = 8, accuracy = 0.310000
k = 8, accuracy = 0.281000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.291000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.298000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.289000
k = 10, accuracy = 0.288000
k = 12, accuracy = 0.268000
k = 12, accuracy = 0.302000
k = 12, accuracy = 0.287000
k = 12, accuracy = 0.280000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.269000
k = 15, accuracy = 0.299000
k = 15, accuracy = 0.294000
k = 15, accuracy = 0.291000
k = 15, accuracy = 0.283000
k = 20, accuracy = 0.265000
k = 20, accuracy = 0.291000
k = 20, accuracy = 0.290000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.282000
k = 50, accuracy = 0.274000
k = 50, accuracy = 0.289000
k = 50, accuracy = 0.276000
k = 50, accuracy = 0.264000
k = 50, accuracy = 0.273000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.274000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.259000
k = 100, accuracy = 0.265000

由上可以看出，这里k = 8的准确率比较高。

然后，设置k = 8，再讲所用的training集合合并到一起，再次训练，并最终使用test集合来计算准确率

best_k = 8

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)

结果，最终的准确率为

Got 145 / 500 correct => accuracy: 0.290000

可以看到，即使经过调优，knn算法的准确率也只有29%，可以知道knn算法并不适合用于图像分类学习任务。

cs231n - assignment1 - knn