回归与梯度下降源码及对应分析

参考资料：
1.2014斯坦福大学机器学习视频前5章视频，超赞，考完研果然数学关好了很多
2.学习了Numpy和matplotlib两个库

后面还有用一种叫normal equation的方法，用矩阵求逆进行求解

首先上打印结果，源码在最后。
首先我们看到参数除了第一个相差有点大之外，其他两个都还好。
相差正负0.5（数据范围是0~100）的准确率是86%，最后一长串的是两个结果的对比，还是可以的。

数据实际设定的的参数（模型）是：1、2、-3
梯度下降计算出来的参数（模型）是：[0.13278973]、[2.00740646]、[-2.99257226]
相差小于总体范围1%的准确率为:86%
对比预测结果与应该产生的结果的对比：
[[-1.35539348e+02 -1.36000000e+02]
 [-2.80146762e+02 -2.80000000e+02]
 [-2.91997353e+01 -2.90000000e+01]
 [-7.64876940e+01 -7.70000000e+01]
 [-1.71280696e+02 -1.71000000e+02]
 [-6.47769289e+01 -6.50000000e+01]
 [-7.91253728e+01 -7.90000000e+01]
 [ 1.56888486e+01  1.60000000e+01]
 [ 1.79814205e+02  1.80000000e+02]
 [-1.57303000e+02 -1.57000000e+02]
 [-1.39310492e+02 -1.39000000e+02]
 [ 7.08188028e+00  7.00000000e+00]
 [ 4.03179937e+01  4.10000000e+01]
 [-3.85482006e+01 -3.80000000e+01]
 [-1.84850568e+02 -1.85000000e+02]
 [-2.22664999e+01 -2.20000000e+01]
 [-4.71551689e+01 -4.70000000e+01]
 [-1.54199203e+02 -1.54000000e+02]
 [-7.45915121e+01 -7.50000000e+01]
 [ 8.49629088e+01  8.50000000e+01]
 [-6.19697308e+01 -6.20000000e+01]
 [-1.34850781e+02 -1.35000000e+02]
 [-3.87547946e+01 -3.90000000e+01]
 [ 6.13378932e+00  6.00000000e+00]
 [ 3.33476834e+01  3.40000000e+01]
 [-2.31243365e+02 -2.31000000e+02]
 [ 9.41259998e+01  9.40000000e+01]
 [-1.98902413e+02 -1.99000000e+02]
 [-2.25630986e+01 -2.20000000e+01]
 [ 1.24629065e+02  1.25000000e+02]
 [-1.58362316e+02 -1.58000000e+02]
 [-8.12069290e+01 -8.10000000e+01]
 [ 7.44951003e-02  0.00000000e+00]
 [ 1.42844022e+02  1.43000000e+02]
 [-2.12880109e+02 -2.13000000e+02]
 [-1.54665321e+02 -1.55000000e+02]
 [-1.44310470e+02 -1.44000000e+02]
 [-2.04961707e+02 -2.05000000e+02]
 [ 1.20577177e+02  1.21000000e+02]
 [-7.52588593e+01 -7.50000000e+01]
 [-3.45704625e+01 -3.40000000e+01]
 [-2.82080019e+02 -2.82000000e+02]
 [-1.83902477e+02 -1.84000000e+02]
 [-2.15295338e+02 -2.15000000e+02]
 [-1.25369871e+02 -1.25000000e+02]
 [-7.96444207e+01 -7.90000000e+01]
 [-1.62080530e+02 -1.62000000e+02]
 [-5.72292761e+01 -5.70000000e+01]
 [-1.96446761e+01 -1.90000000e+01]
 [ 1.22955332e+02  1.23000000e+02]
 [ 1.28210999e+00  1.00000000e+00]
 [-1.51932273e+02 -1.52000000e+02]
 [-1.25873064e+02 -1.26000000e+02]
 [ 6.40149030e+01  6.40000000e+01]
 [ 2.63773731e+01  2.70000000e+01]
 [ 1.06525332e+02  1.07000000e+02]
 [-2.43250729e+02 -2.43000000e+02]
 [-7.73404155e+01 -7.70000000e+01]
 [-5.07992334e+01 -5.10000000e+01]
 [ 8.48146094e+01  8.50000000e+01]
 [-2.35909671e+02 -2.36000000e+02]
 [-1.77006316e+02 -1.77000000e+02]
 [-1.34050989e+02 -1.34000000e+02]
 [-1.15295764e+02 -1.15000000e+02]
 [-8.86804321e+01 -8.90000000e+01]
 [-2.41256069e+01 -2.40000000e+01]
 [-1.04088192e+02 -1.04000000e+02]
 [ 3.57026016e+00  4.00000000e+00]
 [ 1.54925536e+02  1.55000000e+02]
 [-1.16132630e+02 -1.16000000e+02]
 [ 8.19703365e+01  8.20000000e+01]
 [-3.20810831e+01 -3.20000000e+01]
 [-1.04310641e+02 -1.04000000e+02]
 [-1.25110347e+02 -1.25000000e+02]
 [-1.91783803e+02 -1.92000000e+02]
 [ 3.43328492e+01  3.50000000e+01]
 [-1.48139909e+02 -1.48000000e+02]
 [ 1.34777322e+02  1.35000000e+02]
 [ 6.83846093e+01  6.90000000e+01]
 [ 4.40891378e+01  4.40000000e+01]
 [-9.61697903e+01 -9.60000000e+01]
 [ 5.88666248e+01  5.90000000e+01]
 [-6.44220355e+01 -6.40000000e+01]
 [ 9.56885081e+01  9.60000000e+01]
 [-2.82080019e+02 -2.82000000e+02]
 [ 4.37183893e+01  4.40000000e+01]
 [-2.62071631e+01 -2.60000000e+01]
 [-7.36434211e+01 -7.40000000e+01]
 [-1.09477030e+01 -1.10000000e+01]
 [-7.94961213e+01 -7.90000000e+01]
 [ 8.20815611e+01  8.20000000e+01]
 [ 2.02068543e+01  2.10000000e+01]
 [ 3.28951644e+00  3.00000000e+00]
 [ 5.48147371e+01  5.50000000e+01]
 [ 2.74737634e+01  2.80000000e+01]
 [ 1.48977466e+02  1.49000000e+02]
 [-1.38510700e+02 -1.38000000e+02]
 [ 5.00001326e+01  5.00000000e+01]
 [-2.19243529e+00 -2.00000000e+00]
 [-8.08361805e+01 -8.10000000e+01]]

可以看到迭代次数到一定的时候就基本保持不变了。

image.png

代码大致分析：
main函数里面分为产生随机数据、训练模型、测试模型三部分，重点是训练模型，而训练参数又是下图左上角的公式迭代计算出来的，这个式子重点又是式子的右侧，也就是偏微分的计算，这里实际上也不是很难，就是计算的时候向量化了而已，具体看后面的代码。

image.png

总结：
准确率收到很多数据的影响，包括样本数量（numOfSamples）、参数调整的步伐大小（step）——步伐太大通常会导致cost变得非常大、迭代的次数（i），迭代起点（全部设置为0）。
注意——代码中每次都会用随机产生的数据，所以要看看某个变量是否影响准确率的时候要改为数据每次都一样的。

需要改进：
1.我这里的特征的动态范围都是差不多的，视频中讲到如果特征范围相差太大，需要进行归一化。

全部代码：

from numpy import *
import matplotlib.pyplot as plt


theta_pre_setting = [1,2,-3]

# 梯度下降训练多元线性回归模型
# 输入参数，Data是个矩阵，array类型，m*n，m个样本，n个特征
# 返回值线性系数
def train_linear_regression_model(input,output):
    numOfSamples = input.shape[0]
    numOfFeatures = input.shape[1]
    theta = array([0,0,0]).reshape((-1,1))

    # 偏微分
    partial_diff = zeros(numOfFeatures)

    # 每次调整的步伐
    step = 0.0001

    # 记录调整过程代价的变化
    cost_list = []

    i = 0
    while True:
        # 更新theta
        partial_diff = dot(input.T,predict_result(input,theta) - output)/numOfSamples
        theta = theta - step * partial_diff

        # 计算新的代价
        cost = sum(power((predict_result(input,theta) - output),2))
        # print(theta)
        # print(cost)

        cost_list.append(cost)

        # if cost < 1:
        #     break

        i += 1
        if i > 10000:
            break

    plt.plot(cost_list)
    plt.show()
    return theta


def predict_result(input,theta):
    return dot(input,theta)


def generate_data(numOfSamples):
    area = random.randint(0, 100, size=[numOfSamples,])
    age = random.randint(0, 100, size=[numOfSamples,])
    price = calc_price(area, age)

    # 这里reshape((-1,1))用来转置一维向量
    input = concatenate([area.reshape((-1, 1)), age.reshape((-1, 1))], axis=1)

    # 为了计算方便在最前面增加多一个特征，恒为1
    temp = ones(numOfSamples)
    input = concatenate([temp.reshape((-1, 1)), input], axis=1)
    output = price.reshape((-1, 1))

    return input,output


# 测试
def test_data(theta):
    input,output = generate_data(100)
    ret = predict_result(input,theta)

    print('相差小于总体范围1%的准确率为:{a}%'.format(a=len(where(abs(output - ret)<=0.5)[0])))

    # 为了对比预测数据与应该正确产生的数据
    print('对比预测结果与应该产生的结果的对比：')
    ret = concatenate([ret.reshape((-1, 1)), output.reshape((-1, 1))], axis=1)
    print(ret)


def calc_price(area,age,theta = theta_pre_setting):
    return theta[0]+theta[1]*area+theta[2]*age


def main():
    print('数据实际设定的的参数（模型）是：{theta0}、{theta1}、{theta2}'.format(theta0=theta_pre_setting[0],theta1=theta_pre_setting[1],theta2=theta_pre_setting[2]))

# 产生模拟数据
    numOfSamples = 500
    input,output = generate_data(numOfSamples)

# 训练模型，得到系数theta
    theta = train_linear_regression_model(input,output)
    print('梯度下降计算出来的参数（模型）是：{theta0}、{theta1}、{theta2}'.format(theta0=theta[0],theta1=theta[1],theta2=theta[2]))
    test_data(theta)


if __name__ == '__main__':
    main()

normal equation

# 令导数为0的解决方法
def normal_equation(input,output):
    #近似求逆
    return np.linalg.lstsq(input,output)
    # return np.dot(np.linalg.inv(input),output)

def main():
    input,output = generate_data(100)
    theta = normal_equation(input,output)
    print(theta)

回归与梯度下降源码及对应分析

回归与梯度下降源码及对应分析

相关阅读更多精彩内容

友情链接更多精彩内容