登录注册写文章

线性回归

线性回归

概念

label 试图预测的变量
feature 预测依据的变量
example 包含一对 label 和 feature，用来训练模型
Training

向模型中加入 example，模型根据 example 自我修改feature 和 example 关系的过程
Inference

在模型训练完后，根据 feature 预测对应的 label值
三种变量
- A continuous variable 连续变量
  - any value is possible for the variable.
  - 连续变量是在任意两个值之间具有无限个值的数值变量。连续变量可以是数值变量，也可以是日期/时间变量。例如，零件的长度，或者收到付款的日期和时间
  - 连续变量与离散变量的简单区别方法：连续变量是一直叠加上去的，增长量可以划分为固定的单位
- discrete variable 离散变量
  - only take on a certain number of values. 只能是某些值
  - 离散变量是在任意两个值之间具有可计数的值的数值变量。离散变量始终为数值变量。例如，客户投诉数量或者瑕疵或缺陷数。
  - 是通过计数方式取得的
- a categorical variable 分类变量
  - take on one of a limited and usually fixed number of possible values
  - 可以采用有限且通常固定数量的可能值之一的变量
  - 类别变量包含有限的类别数或可区分组数。类别数据可能不是逻辑顺序。例如，类别变量包括性别、材料类型和付款方式。
    
    比如有关于天气的变量：晴，阴，雨。只能是其中单独一个，不存在介于两种之间的，即不能又晴又

latex 输入数学公式

最大值、最小值函数等用\max、 \min输入，不能直接写max、min等
将限制条件a<x<b放在max 正下方：
上下标 a_{1} b^{2}
除号 \frac{}{}

线性回归

回归是根据数据确定两种或两种以上变量间相互依赖的定量关系的办法

线性回归使用最佳的拟合直线（也就是回归线）在因变量（Y）和一个或多个自变量（X）之间建立一种关系
训练回归模型是

根据训练数据，找到最佳参数以最小化模拟结果和真实值之间的误差的过程。

然后用训练好的模型预测目标值 target。
线性回归是有条件的
- feature 矩阵满秩
Simple Linear Regression 简单线性回归

只有一个feature特征值
Multiple Linear Regression 多变量线性回归

有多个特征值
loss 函数或者 cost 函数
- 预测值和实际值的误差
- 目标函数
  
  理想情况是所有点都落在直线上。
  
  如果的值最小时，拟合的效果最好
  
  训练的结果是确定、的值，得到目标函数的最小值
  
  有两种方法找到这条直线
  - 普通最小二乘法（OSL）
    
    分别对a和b求一阶偏导：
    
    image
    
    image
    
    求导之后我们分别让其等于0，得到当目标函数取最小值时的各参数值
    
    image
    - 手动实现
      
      <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="python" cid="n111" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> def classic_lstsqr(x_list, y_list):
      N = len(x_list)
      x_avg = sum(x_list)/N
      y_avg = sum(y_list)/N
      var_x, cov_xy = 0, 0
      for x,y in zip(x_list, y_list):
      temp = x - x_avg
      var_x += temp*2
      cov_xy += temp * (y - y_avg)
      slope = cov_xy / var_x
      y_interc = y_avg - slopex_avg
      return (slope, y_interc)</pre>
    - 梯度下降法
  类型变量处理
  - "Dummy Variables".
    
    Dummy 变量虚拟变量，也叫哑变量和离散特征编码，可用来表示分类变量、非数量因素可能产生的影响。有时也称为布尔指示变量。
    
    引入哑变量的目的是，将不能够定量处理的变量量化，例如：职业、性别、季节
    
    根据这些因素的属性类型，构造只取“0”或“1”的人工变量，通常称为哑变量（dummy variables），记为D
    
    举一个例子，假设变量“职业”的取值分别为：工人、农民、学生、企业职员、其他，5种选项，我们可以增加4个哑变量来代替“职业”这个变量，分别为D1（1=工人/0=非工人）、D2(1=农民/0=非农民)、D3（1=学生/0=非学生）、D4(1=企业职员/0=非企业职员)，最后一个选项“其他”的信息已经包含在这4个变量中了，所以不需要再增加一个D5（1=其他/0=非其他）了。这个过程就是引入哑变量的过程，其实在结合分析（conjoint analysis）中，就是利用哑变量来分析各个属性的效用值的。
  - 如何处理
    - 离散特征的取值之间有大小的意义
      - 例如：尺寸(L、XL、XXL)
      - 处理函数map pandas.Series.map(dict)
      - 参数 dict: 映射的字典类型
    - <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="python" cid="n135" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 0px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> """
      博士后 Post-Doc
      博士 Doctorate
      硕士 Master's Degree
      学士 Bachelor's Degree
      副学士 Associate's Degree
      专业院校 Some College
      职业学校 Trade School
      高中 High School
      小学 Grade School
      """
      educationLevelDict = {
      'Post-Doc': 9,
      'Doctorate': 8,
      'Master's Degree': 7,
      'Bachelor's Degree': 6,
      'Associate's Degree': 5,
      'Some College': 4,
      'Trade School': 3,
      'High School': 2,
      'Grade School': 1
      }
      
      data['Education Level Map'] = data[
      'Education Level'
      ].map(
      educationLevelDict
      )
      </pre>
    - 字典是另一种可变容器模型，且可存储任意类型对象。
      
      字典的每个键值 key=>value 对用冒号 : 分割，每个键值对之间用逗号 , 分割，整个字典包括在花括号 {} 中 ,格式如下所示
      
      <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="" cid="n139" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> d = {key1 : value1, key2 : value2 }</pre>
      - 键一般是唯一的，如果重复最后的一个键值对会替换前面的，值不需要唯一。
        
        <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="" cid="n143" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> dict = {'a': 1, 'b': 2, 'b': '3'}</pre>
      - 访问字典里的值
        
        把相应的键放入方括弧
        
        <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="" cid="n147" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
        print( "dict['Name']: ", dict['Name'])
        print ("dict['Age']: ", dict['Age'])</pre>
      - 修改字典
        
        向字典添加新内容的方法是增加新的键/值对，修改或删除已有键/值对如下实例:
        
        <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="" cid="n151" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
        dict['Age'] = 8; # update existing entry
        dict['School'] = "DPS School"; # Add new entry
        print ("dict['Age']: ", dict['Age'])
        print( "dict['School']: ", dict['School'])</pre>
      - 删除字典
        
        用del命令能删单一的元素，，
        
        dict.clear()能清空字典。
        
        <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="python" cid="n156" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 15px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> dict = {'Name': 'Zara', 'Age': 7, 'Class': 'First'};
        del dict['Name']; # 删除键是'Name'的条目
        dict.clear(); # 清空词典所有条目
        del dict ; # 删除词典
        print ("dict['Age']: ", dict['Age'])
        print ("dict['School']: ", dict['School'])</pre>
      - 用 np.where() 函数处理二元的类型变量
        
        e.g. np.where(wdbc.Diagnosis == 'B', 1, 0)
      - 离散特征的取值之间没有大小的意义
        
        颜色(Red,Blue,Green)
        
        get_dummies(data,prefix=None,prefix_sep="_",dummy_na=False,columns=None,drop_first=False)
        
        ① data 要处理的DataFrame ② columns 要处理的列名，如果不指定该列，那么默认处理所有列 ③ drop_first 是否从备选项中删除第一个，建模的时候为避免多重共线性
        
        多重共线性是指线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确
        
        <pre spellcheck="false" class="md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded" lang="python" cid="n168" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: normal; background-image: inherit; background-size: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative !important; border: 1px solid rgb(231, 234, 237); border-top-left-radius: 3px; border-top-right-radius: 3px; border-bottom-right-radius: 3px; border-bottom-left-radius: 3px; padding: 8px 4px 6px 0px; margin-bottom: 15px; margin-top: 0px; width: inherit; background-position: inherit inherit; background-repeat: inherit inherit;"> import pandas as pd
        import numpy as np
        
        s = pd.Series(list('YNNY'))
        print(s)
        print(pd.get_dummies(s).Y)
        
        a = np.where(s == 'Y', 1, 0)
        print(a)</pre>
    使用工具库完成线性回归
    - 选择 feature
      - 选择哪个 feature 能更好的预测 label 的值
        
        每个 feature与依赖变量的散点图
        
        计算自变量和从属变量之间的线性相关性
        
        相关分数仅反映变量之间的线性相关性。如果存在强的非线性关系，则可能会错过。
      - 计算相关分数
        
        皮尔逊相关系数
        
        定义式
        
        [1]
        
        image
        
        Cov(X,Y)为X与Y的协方差，Var[X]为X的方差，Var[Y]为Y的方差
        
        性质
        
        （1）
        
        image
        
        （2）
        
        image
        
        的充要条件是，存在常数a，b，使得
        
        image
        
        相关系数定量地刻画了 X 和 Y的相关程度，
        
        即
        
        image
        
        越大，相关程度越大；
        
        image
        
        对应相关程度最低；
        
        X 和Y 完全相关的含义是在概率为1的意义下存在线性关系，于是
        
        image
        
        是一个可以表征X 和Y 之间线性关系紧密程度的量。
        
        image
        
        较大时，通常说X 和Y相关程度较好；
        
        当
        
        image
        
        较小时，通常说X 和Y相关程度较差；当X和Y不相关，通常认为X和Y之间不存在线性关系，但并不能排除X和Y之间可能存在其他关系。
      - dateframe.corr()
        
        默认是 pearson 方法
        
        返回各列之间的相关分数
        
        绝对值越大的相关性越好
    - 画散点图
      - plt.scatter(x,y,s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None)
      - s 各点的大小，可以是一个数字，如果传入一个长度和 x 相同的 List ，则规定每个点的大小
      - c 个点的颜色，可以是一个代表颜色的字符，如果传入一个长度和 x 相同的 List 规定每个点的颜色
      - marker 每个点的形状
      - alpha 透明度
    - 调用工具库
      - import statsmodels.api as sm
        
        数学模型库
        
        sm.OLS 线性回归模型最小二乘法
      - from sklearn.linear_model import LinearRegression
      - 修改feature 矩阵，在 feature 前加一列1
        
        statsmodels.tools.add_constant(data, prepend=True, has_constant='skip')
        
        data (array-like) – data is the column-ordered design matrix
        
        prepend (bool) – If true, the constant is in the first column. Else the constant is appended (last column).
        
        为什么要用add_constant
        
        因为sm.OLS 默认模型是没有截距 intercept 的
        
        而我们的 feature 里是包括截距和斜率的，如果只有一个 feature，最后拟合的结果也是只有一个斜率
        
        数学证明在网页上
        
        常数的加法可以通过将X与秩n的n×n矩阵Z相乘来表示。这是通过获取单位矩阵并将常量（例如x = 2（但x不能为-1））添加到与截距相关的列i对应的行来完成的
        
        为什么不在每个 x 的值上加个值
        
        原因是可能这个值就将矩阵中的某个数变成0，失去了相关性
      - 建立模型
        
        model = sm.OLS( label, feature )
        
        Ordinary least squares
        
        建立线性回归模型，第一个数据是待预测的变量，第二个数据是建模依据的变量返回模型对象
      - 训练模型
        
        result = model.fit( )
        
        返回results对象
      - 得到建模参数
        
        results.params
        
        返回建模参数 { }
      - 得到模型数据摘要
        
        results.summary()
        
        摘要将显示系数值（β）和统计量，如R平方和p值等。
      - 预target值
        
        results.predict(params)
        
        输入feature矩阵，这里需要第一列加1
        
        返回预测的 label 值
      - 使用statsmodels公式接口来定义模型( R 语言风格)
        
        formular = 'str'
        
        “~” 左边是feature 右边是 label
        
        这里 feature 不用加上一列1
      - 画出回归函数图
        
        np.linspace( start , stop, num)
        
        生成等间距的数字 list
        
        start 起点
        
        stop 终点
        
        num 数字个数
      - 模型分析
        
        Goodness of fit 模型适合度·
        
        the Root Mean Square Error（RMSE）均方根误差
        
        测量总误差（每个训练数据点的回归线和实际因变量值之间的距离，对所有训练数据点求和）。
        
        R 方
        
        R平方值是模型解释的方差量的度量。
        
        R 方越大越好，值为1表示该模型完全解释了所有方差。但是，在大多数情况下，这将被视为过度拟合。
        
        R²为回归平方和与总离差平方和的比值，这一比值越大，表示总离差平方和中可以由回归平方和解释的比例越大，模型越精确，回归效果越显著
        
        变量的影响
        
        p值是指在一个概率模型中，统计摘要（如两组样本均值差）与实际观测数据相同，或甚至更大这一事件发生的概率。换言之，是检验假设零假设成立或表现更严重的可能性。p值若与选定显著性水平（0.05或0.01）相比更小，则零假设会被否定而不可接受。
        
        临界值一般是0.05
        
        p<0.05说明这个因素对结果有影响，保留此因素，p>0.05说明这个因素对结果无影响
        
        回归里出现的p值也是针对于假设检验来说的。
        
        假设你的回归模型是Y=aX1+bX2+c.Y=aX1+bX2+c.
        
        aa所对应的假设检验中，零假设是在bb和cc都是正确值得情况下a=0a=0，对立假设是在bb和cc都正确的情况下a≠0a≠0。这一般都是采用双侧t检验。aa所对应的p值就是这个假设检验的p值。
        
        系数值：系数/参数值的大小。该值越大，它对转移因变量值的贡献就越大。较大的系数值意味着该变量具有实际意义。
        
        预测
        
        根据拟合的模型对输入的参数的结果进行预测
        
        每行预测值之前也要加一列1
        
        formula 方法
        
        最简单的方法是使用字典列表
        
        或者是 dataframe
      - Newton-Raphson方法
        
        查找函数根的算法
        
        np.inf 正无穷大的浮点数
        
        学这个的原因，
        
        我们回归要得到的结果是要让 loss 函数最小，
        
        首先我们要对 loss 函数求导
        
        然后求让导数为0 时，对应的极值点，
        
        就是求 loss 函数对应的导函数为0 的根
        
        很多时候，优化函数g（x）可以归结为g'（x）= 0的根找到。

©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成，浏览时请结合常识与多方信息审慎甄别。
平台声明：文章内容（如有图片或视频亦包括在内）由作者上传并发布，文章内容仅代表作者本人观点，简书系信息发布平台，仅提供信息存储服务。

推荐阅读更多精彩内容

【译文】用Python实现简单和多重线性回归--预测波士顿房价
快速介绍Python中的线性回归嗨，大家好！在简要介绍Panads库和NumPy库之后，我想快速介绍一下在Pyt...
c9af2eadd50d阅读 14,093评论 0赞 8
跟我一起学scikit-learn16：线性回归算法
线性回归算法是使用线性方程对数据集拟合得算法，是一个非常常见的回归算法。本章首先从最简单的单变量线性回归算法开始介...
金字塔下的小蜗牛阅读 7,946评论 0赞 4

对线性回归，logistic回归和一般回归的认识
1 摘要本报告是在学习斯坦福大学机器学习课程前四节加上配套的讲义后的总结与认识。前四节主要讲述了回归问题，回归属...
chaaffff阅读 7,797评论 0赞 2
机器学习实践系列1——线性回归
摘要：本文结合实际案例，介绍机器学习的线性回归模型，包括一元线性回归和多元线性回归，以及模型的评估。案例展示用Py...
刺猬ciwei_532a阅读 10,590评论 1赞 8
R做多元线性回归全攻略
R中的线性回归函数比较简单，就是lm()，比较复杂的是对线性模型的诊断和调整。这里结合Statistical Le...
真依然很拉风阅读 66,996评论 1赞 64

友情链接更多精彩内容

赞1赞

赞赏

手机看全文