1.回归中的异常值

考虑异常值，假设线性回归将最小化误差的平方和，那么哪个是最佳的线性回归

image.png

2. 产生异常值的原因

传感器故障 ignore
数据输入错误 ignore
反常事件 pay attention

3. 选择异常值

image.png

4. 异常值检测/删除算法

步骤：

训练所有数据
找出训练集中访问错误最多的点，去除这些点，这些点一般占据全部数据的10%
*对当前减小后的数据集再次进行训练
重复以上

5.使用残差的异常值检测

残差 residual error
在对数据进行拟合后，数据点所产生的误差

6. 删除异常值对回归的影响

7. 异常值删除策略的小结

如果要清理拟合结果，就要去除异常值
如果进行的是异常检测或者欺诈检测，那么就要去除好的数据点，保留异常数值
无论哪种情况，适用于所有机器算法的好算法是：

训练数据
去掉误差最大的点，一般称为残差
重复以上

8. 异常值迷你项目简介

明显的异常值可能对回归结果有很大的影响
本项目就是去除与回归线间残差最大的10%左右的数据点，去除后再重新拟合回归
迷你项目也会讲到，在安然数据集中，我们要去除异常值还是重点关注异常值

9. 异常值迷你项目

此项目有两部分。在第一部分中将运行回归，然后识别并删除具有最大残差的 10% 的点。然后，根据 Sebastian 在课程视频中所建议的，从数据集中删除那些异常值并重新拟合回归。
在第二部分中，你将熟悉安然财务数据中的一些异常值，并且了解是否/如何删除它们。

10. 带有异常值的回归斜率

Sebastian 向我们描述了改善回归的一个算法，你将在此项目中实现该算法。你将在接下来的几个测试题中运用这一算法。总的来说，你将在所有训练点上拟合回归。舍弃在实际 y 值和回归预测 y 值之间有最大误差的 10% 的点。
先开始运行初始代码 (outliers/outlier_removal_regression.py) 和可视化点。一些异常值应该会跳出来。部署一个线性回归，其中的净值是目标，而用来进行预测的特征是人的年龄（记得在训练数据上进行训练！）。
数据点主体的正确斜率是 6.25（我们之所以知道，是因为我们使用该值来生成数据）；你的回归的斜率是多少？

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print reg.coef_               #5.08

11. 带有异常值的回归分数

当使用回归在测试数据上进行预测时，你获得的分数是多少？
你的回归应用到测试数据后的得分是多少？

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print reg.coef_
print reg.score(ages_test,net_worths_test)     #0.88

12.清理后的斜率

你将在 outliers/outlier_cleaner.py 中找到 outlierCleaner() 函数的骨架并向其填充清理算法。用到的三个参数是：predictions 是一个列表，包含回归的预测目标；ages 也是一个列表，包含训练集内的年龄；net_worths 是训练集内净值的实际值。每个列表中应有 90 个元素（因为训练集内有 90 个点）。你的工作是返回一个名叫cleaned_data 的列表，该列表中只有 81 个元素，也即预测值和实际值 (net_worths) 具有最小误差的 81 个训练点 (90 * 0.9 = 81)。cleaned_data 的格式应为一个元组列表，其中每个元组的形式均为 (age, net_worth, error)。

一旦此清理函数运行起来，你应该能看到回归结果发生了变化。新斜率是多少？是否更为接近 6.25 这个“正确”结果？

现在当异常值被清除后，你的回归的新斜率是多少？

注意：在 outliers/outlier_removal_regression.py 执行异常值清理的部分中（以注释 ### identify and remove the most outlier-y points 开头），请确保 reg.predict 的输入参数是 ages_train 而非 ages，这样你就只是基于训练数据进行清理。清理器的参数还应基于 *_train 变量。

def outlierCleaner(predictions, ages, net_worths):
    """
        Clean away the 10% of points that have the largest
        residual errors (difference between the prediction
        and the actual net worth).

        Return a list of tuples named cleaned_data where 
        each tuple is of the form (age, net_worth, error).
    """
    
    cleaned_data = []
    error = (net_worths-predictions)**2 #数组
    data = zip(ages,net_worths,error)       #zip() 函数用于将可迭代的对象作为参数，
                                            #将对象中对应的元素打包成一个个元组，
                                            #然后返回由这些元组组成的列表。
    sorted_data = sorted(data,key=lambda tup:tup[2])
    cleaned_data=sorted_data[:81]    
    return cleaned_data

#!/usr/bin/python

import random
import numpy
import matplotlib.pyplot as plt
import pickle

from outlier_cleaner import outlierCleaner


### load up some practice data with outliers in it
ages = pickle.load( open("practice_outliers_ages.pkl", "r") )
net_worths = pickle.load( open("practice_outliers_net_worths.pkl", "r") )


### ages and net_worths need to be reshaped into 2D numpy arrays
### second argument of reshape command is a tuple of integers: (n_rows, n_columns)
### by convention, n_rows is the number of data points
### and n_columns is the number of features
ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))
from sklearn.cross_validation import train_test_split
ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)

### fill in a regression here!  Name the regression object reg so that
### the plotting code below works, and you can see what your regression looks like

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(ages_train,net_worths_train)
print reg.coef_
print reg.score(ages_test,net_worths_test)



try:
    plt.plot(ages, reg.predict(ages), color="blue")
except NameError:
    pass
plt.scatter(ages, net_worths)
plt.show()


### identify and remove the most outlier-y points
cleaned_data = []
try:
    predictions = reg.predict(ages_train)
    cleaned_data = outlierCleaner( predictions, ages_train, net_worths_train )
except NameError:
    print "your regression object doesn't exist, or isn't name reg"
    print "can't make predictions to use in identifying outliers"




### only run this code if cleaned_data is returning data
if len(cleaned_data) > 0:
    ages, net_worths, errors = zip(*cleaned_data)
    ages       = numpy.reshape( numpy.array(ages), (len(ages), 1))
    net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1))

    ### refit your cleaned data!
    try:
        reg.fit(ages, net_worths)
        plt.plot(ages, reg.predict(ages), color="blue")
        print reg.coef_     #6.37
        print reg.score(ages_test,net_worths_test)   #0.98    
    except NameError:
        print "you don't seem to have regression imported/created,"
        print "   or else your regression object isn't named reg"
        print "   either way, only draw the scatter plot of the cleaned data"
    plt.scatter(ages, net_worths)
    plt.xlabel("ages")
    plt.ylabel("net worths")
    plt.show()


else:
    print "outlierCleaner() is returning an empty list, no refitting to be done"

13. 清理后的分数

当使用回归在测试集上进行预测时，新的分数是多少？

14. 安然异常值

在本节回归课程的迷你项目中，你使用回归来预测安然雇员的奖金。如你所见，单一的异常值都可以对回归结果造成很大的差异。但是，我们之前没有跟你说过的是，你在项目中使用的数据集已经被清理过明显的异常值了。第一次看到数据集时，识别并清除异常值是你一直应该思考的问题，而你现在已经通过安然数据有了一定的实践经验。
你可以在 outliers/enron_outliers.py 中找到初始代码，该代码读入数据（以字典形式）并将之转换为适合 sklearn 的 numpy 数组。由于从字典中提取出了两个特征（“工资”和“奖金”），得出的 numpy 数组维度将是 N x 2，其中 N 是数据点数，2 是特征数。对散点图而言，这是非常完美的输入；我们将使用 matplotlib.pyplot 模块来绘制图形。（在本课程中，我们对所有可视化均使用 pyplot。）将这些行添加至脚本底部，用以绘制散点图：

for point in data:
    salary = point[0]
    bonus = point[1]
    matplotlib.pyplot.scatter( salary, bonus )

matplotlib.pyplot.xlabel("salary")
matplotlib.pyplot.ylabel("bonus")
matplotlib.pyplot.show()

如你所见，可视化是查找异常值最强大的工具之一！

15. 识别最大的安然异常值

有一个异常值应该会立即跳出来。现在的问题是识别来源。我们发现原始数据源对于识别工作非常有帮助；你可以在 final_project/enron61702insiderpay.pdf 中找到该 PDF。
该数据点的字典键名称是什么？（例如：如果是 Ken Lay，那么答案就是“LAY KENNETH L”）。

data_dict = pickle.load( open("../final_project/final_project_dataset.pkl", "r") )
def find_outlier(data_dict):
    max_bonus = 0
    max_name = None
    for i in data_dict:
        if data_dict[i]['bonus']> max_bonus and data_dict[i]['bonus']!= 'NaN':
            max_bonus = data_dict[i]['bonus']
            max_name = i
    return max_name
print find_outlier(data_dict)  #TOTAL

16. 移除安然异常值？

你认为这个异常值应该并清除，还是留下来作为一个数据点？

清除掉，它是一个电子表格 bug

17. 还有更多异常值吗？

从字典中快速删除键值对的一种方法如以下行所示：

dictionary.pop( key, 0 )

写下这样的一行代码（你必须修改字典和键名）并在调用 featureFormat() 之前删除异常值。然后重新运行代码，你的散点图就不会再有这个异常值了。

image.png

所有异常值都没了吗？

Enron 数据中还有异常值吗？

可能还有四个

18 再识别两个异常值

我们认为还有 4 个异常值需要调查；让我们举例来看。两人获得了至少 5 百万美元的奖金，以及超过 1 百万美元的工资；换句话说，他们就像是强盗。
和这些点相关的名字是什么？

def find_more_outliers(data_dict):
    for i in data_dict:
        if data_dict[i]['bonus']!='NaN' and data_dict[i]['salary']!='NaN':
            if data_dict[i]['bonus']>5e6 and data_dict[i]['salary']>1e6:
                print i,data_dict[i]['bonus'],data_dict[i]['salary']
            
find_more_outliers(data_dict)  #LAY KENNETH L 7000000 1072321
                               #SKILLING JEFFREY K 5600000 1111258

19 移除这些异常值？

你是否会猜到这些就是我们应该删除的错误或者奇怪的电子表格行，你是否知道这些点之所以不同的重要原因？（换句话说，在我们试图构建 POI 识别符之前，是否应该删除它们？）

你认为这个异常值应该并清除，还是留下来作为一个数据点？

留下来，它是有效的数据点
Yes! They're two of Enron's biggest bosses, and definitely people of interest.

异常值

1.回归中的异常值

2. 产生异常值的原因

3. 选择异常值

4. 异常值检测/删除算法

5.使用残差的异常值检测

6. 删除异常值对回归的影响

7. 异常值删除策略的小结

8. 异常值迷你项目简介

9. 异常值迷你项目

10. 带有异常值的回归斜率

11. 带有异常值的回归分数

12.清理后的斜率

13. 清理后的分数

14. 安然异常值

15. 识别最大的安然异常值

16. 移除安然异常值？

17. 还有更多异常值吗？

18 再识别两个异常值

19 移除这些异常值？

推荐阅读更多精彩内容