cs231n - assignment1 - knn

跟着cs231n assignment1的knn部分的notebook引导,把这个作业做完了。knn的算法本身很简单,习题主要目的是希望让学习者对下面几个方面有所熟悉:

  1. 一些numpy api的使用
  2. 使用numpy矩阵运算提高算法效率的体会
  3. 使用cross validation的方法,来选择hyper-parameter(这个习题中是knn算法中的k值)

虽然算法原理很简单,不过实际完成时还是遇到不少问题,花了3、4个小时,主要是对numpy的api不熟悉,一边查资料,一边写简单的demo代码验证。同时在实现方法 def compute_distances_no_loops(self, X)时,由于对矩阵数学公式不了解,也花了不少时间。下面顺着该notebook的顺序,记录下本次作业的具体过程。

一、compute_distances_two_loops

首先是使用最简单的两层for循环,计算test样本与training样本的L2距离。
L2距离的定义如下:

捕获.PNG
  def compute_distances_two_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a nested loop over both the training data and the 
    test data.

    Inputs:
    - X: A numpy array of shape (num_test, D) containing test data.

    Returns:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      is the Euclidean distance between the ith test point and the jth training
      point.
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      for j in xrange(num_train):
        #####################################################################
        # TODO:                                                             #
        # Compute the l2 distance between the ith test point and the jth    #
        # training point, and store the result in dists[i, j]. You should   #
        # not use a loop over dimension.                                    #
        #####################################################################
        dists[i, j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
        #####################################################################
        #                       END OF YOUR CODE                            #
        #####################################################################
    return dists

此算法的效率很低,从后面的习题中可以看到在我机器上的具体耗时大约为28s。

二、predict_labels

接下来,对test样本进行预测,具体是实现下面的函数

  def predict_labels(self, dists, k=1):
    """
    Given a matrix of distances between test points and training points,
    predict a label for each test point.

    Inputs:
    - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
      gives the distance betwen the ith test point and the jth training point.

    Returns:
    - y: A numpy array of shape (num_test,) containing predicted labels for the
      test data, where y[i] is the predicted label for the test point X[i].  
    """
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    y_pred2 = np.zeros(num_test)
    for i in xrange(num_test):
      # A list of length k storing the labels of the k nearest neighbors to
      # the ith test point.
      #closest_y = []
      #########################################################################
      # TODO:                                                                 #
      # Use the distance matrix to find the k nearest neighbors of the ith    #
      # testing point, and use self.y_train to find the labels of these       #
      # neighbors. Store these labels in closest_y.                           #
      # Hint: Look up the function numpy.argsort.                             #
      #########################################################################
      sorted_index = np.argsort(dists[i])
      closest_y = self.y_train[sorted_index[:k]]
      #########################################################################
      # TODO:                                                                 #
      # Now that you have found the labels of the k nearest neighbors, you    #
      # need to find the most common label in the list closest_y of labels.   #
      # Store this label in y_pred[i]. Break ties by choosing the smaller     #
      # label.                                                                #
      #########################################################################

      # xh: can't name variable with i in the for comprehensive, as it's conflicted with above i
      # timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]
      timeLabel = sorted([(np.sum(closest_y == y_), y_) for y_ in set(closest_y)])[-1]
      y_pred[i] = timeLabel[1]

      #########################################################################
      #                           END OF YOUR CODE                            # 
      #########################################################################

    return y_pred

该方法就是knn算法的主要部分,对test集中每个样本,由上面得到的其与training集中每个样本的距离,找出前k个距离最近的样本,然后在这k个样本中,对相同的label进行计算,数量最多的那个label作为对test样本的预测label。

在实现这段代码时,花了不少时间,主要是错误的估计了python中for推导式中变量的作用域,第一次写的时候,使用了变量名 i 用于推导式中

timeLabel = sorted([(np.sum(closest_y == i), i) for i in set(closest_y)])[-1]

结果这里i 和推导式外的上层for循环的i冲突

for i in xrange(num_test):

使得最终计算得到的准确率始终不对。查了半天才找出问题所在+_+

三、评估准确率

将上面两个步骤结合起来,得到的预测值,与test集的实际值进行比较,可以得到knn模型的准确率

dists = classifier.compute_distances_two_loops(X_test)
# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)

这里得到的准确率大约为27%

Got 137 / 500 correct => accuracy: 0.274000

然后将k值设置为 k = 5 再试一次,准确率大约提升到28%

Got 143 / 500 correct => accuracy: 0.286000

四、提升距离计算的效率

习题接着分别实现了两种距离计算方法,用以展示不同算法在计算距离上的巨大效率差异。

  1. compute_distances_one_loop
    一层for循环实现:
  def compute_distances_one_loop(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using a single loop over the test data.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      #######################################################################
      # TODO:                                                               #
      # Compute the l2 distance between the ith test point and all training #
      # points, and store the result in dists[i, :].                        #
      #######################################################################
      #dists[i] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
      dists[i, :] = np.sqrt(np.sum(np.square(X[i] - self.X_train), axis=1))
      #######################################################################
      #                         END OF YOUR CODE                            #
      #######################################################################
    return dists
  1. compute_distances_no_loops
    numpy的矩阵运算方式。
  def compute_distances_no_loops(self, X):
    """
    Compute the distance between each test point in X and each training point
    in self.X_train using no explicit loops.

    Input / Output: Same as compute_distances_two_loops
    """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO:                                                                 #
    # Compute the l2 distance between all test points and all training      #
    # points without using any explicit loops, and store the result in      #
    # dists.                                                                #
    #                                                                       #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy.                #
    #                                                                       #
    # HINT: Try to formulate the l2 distance using matrix multiplication    #
    #       and two broadcast sums.                                         #
    #########################################################################

    # infer to http://blog.csdn.net/zhyh1435589631/article/details/54236643

    dists = np.sqrt(
      self.getNormMatrix(X, num_train).T + self.getNormMatrix(self.X_train, num_test) - 2 * np.dot(X, self.X_train.T))
    #########################################################################
    #                         END OF YOUR CODE                              #
    #########################################################################
    return dists

  def getNormMatrix(self, x, lines_num):
    """
    Get a lines_num x size(x, 1) matrix
    """
    return np.ones((lines_num, 1)) * np.sum(np.square(x), axis=1)

这里运用了一个矩阵公式来计算test样本集和training样本集的距离矩阵。公式的推导如下。

设两个矩阵P(M, D)和C(N, D),下图中每项代表矩阵的一行。


1.PNG

首先对P的 i 行和C的 j 行的距离公式进行展开后,重新合并


2.PNG

最后推广到矩阵形式,如下


3.PNG

最终,分别计算三种距离计算方式的耗时如下:

Two loop version took 29.605000 seconds
One loop version took 67.478000 seconds
No loop version took 0.277000 seconds

可见,利用numpy的矩阵运算效率极其的高

五、cross validation交叉验证

接着,使用cross validation的方法,来选择hyper-parameter超参数k的值。

cross validation的原理是,将training样本集分成n份(如下图中的例子,是5份),每一份叫做一个fold,然后依次迭代这n个fold,将其作为validation集合,其余的n-1个fold一起作为training集合,然后进行训练并计算准确率。
选择一组候选k值,依次迭代执行上面描述的过程,最终根据准确率,进行评估选择最合适的k值。

crossval.jpeg
num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
X_train_folds = np.array_split(X_train, num_folds)
y_train_folds = np.array_split(y_train, num_folds)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
for k in k_choices:
    for f in range(num_folds):
        X_train_tmp = np.array(X_train_folds[:f] + X_train_folds[f + 1:])
        y_train_tmp = np.array(y_train_folds[:f] + y_train_folds[f + 1:])
        X_train_tmp = X_train_tmp.reshape(-1, X_train_tmp.shape[2])
        y_train_tmp = y_train_tmp.reshape(-1)
        
        X_va = np.array(X_train_folds[f])
        y_va = np.array(y_train_folds[f])
        
        
        classifier.train(X_train_tmp, y_train_tmp)
        dists = classifier.compute_distances_no_loops(X_va)
        
        y_test_pred = classifier.predict_labels(dists, k)

        # Compute and print the fraction of correctly predicted examples
        num_correct = np.sum(y_test_pred == y_va)
        accuracy = float(num_correct) / y_va.shape[0]
        if (k in k_to_accuracies.keys()):
            k_to_accuracies[k].append(accuracy)
        else:
            k_to_accuracies[k] = []
            k_to_accuracies[k].append(accuracy)
################################################################################
#                                 END OF YOUR CODE                             #
################################################################################

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print 'k = %d, accuracy = %f' % (k, accuracy)

输出结果是

k = 1, accuracy = 0.263000
k = 1, accuracy = 0.257000
k = 1, accuracy = 0.264000
k = 1, accuracy = 0.278000
k = 1, accuracy = 0.266000
k = 3, accuracy = 0.252000
k = 3, accuracy = 0.281000
k = 3, accuracy = 0.266000
k = 3, accuracy = 0.290000
k = 3, accuracy = 0.281000
k = 5, accuracy = 0.266000
k = 5, accuracy = 0.285000
k = 5, accuracy = 0.290000
k = 5, accuracy = 0.303000
k = 5, accuracy = 0.284000
k = 8, accuracy = 0.270000
k = 8, accuracy = 0.310000
k = 8, accuracy = 0.281000
k = 8, accuracy = 0.290000
k = 8, accuracy = 0.291000
k = 10, accuracy = 0.276000
k = 10, accuracy = 0.298000
k = 10, accuracy = 0.296000
k = 10, accuracy = 0.289000
k = 10, accuracy = 0.288000
k = 12, accuracy = 0.268000
k = 12, accuracy = 0.302000
k = 12, accuracy = 0.287000
k = 12, accuracy = 0.280000
k = 12, accuracy = 0.280000
k = 15, accuracy = 0.269000
k = 15, accuracy = 0.299000
k = 15, accuracy = 0.294000
k = 15, accuracy = 0.291000
k = 15, accuracy = 0.283000
k = 20, accuracy = 0.265000
k = 20, accuracy = 0.291000
k = 20, accuracy = 0.290000
k = 20, accuracy = 0.282000
k = 20, accuracy = 0.282000
k = 50, accuracy = 0.274000
k = 50, accuracy = 0.289000
k = 50, accuracy = 0.276000
k = 50, accuracy = 0.264000
k = 50, accuracy = 0.273000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.274000
k = 100, accuracy = 0.265000
k = 100, accuracy = 0.259000
k = 100, accuracy = 0.265000

由上可以看出,这里k = 8的准确率比较高。

然后,设置k = 8,再讲所用的training集合合并到一起,再次训练,并最终使用test集合来计算准确率

best_k = 8

classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)

结果,最终的准确率为

Got 145 / 500 correct => accuracy: 0.290000

可以看到,即使经过调优,knn算法的准确率也只有29%,可以知道knn算法并不适合用于图像分类学习任务

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 217,185评论 6 503
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,652评论 3 393
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 163,524评论 0 353
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,339评论 1 293
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,387评论 6 391
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,287评论 1 301
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,130评论 3 418
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,985评论 0 275
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,420评论 1 313
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,617评论 3 334
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,779评论 1 348
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,477评论 5 345
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,088评论 3 328
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,716评论 0 22
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,857评论 1 269
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,876评论 2 370
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,700评论 2 354

推荐阅读更多精彩内容