Hi everyone,
大家好,
Today, let's talk about logistic regression.
今天这篇文章一起学讨论对数几率回归算法。
In short, logistic regression assumes that the data obeys the Bernoulli distribution, and uses maximum likelihood estimation (MLE) and gradient descent (GD) to solve the parameters to achieve the purpose of binary classification of the data.
简而言之, 对数几率回归是一种假设数据服从伯努利分布, 利用极大似然估计和梯度下降法来求解最优参数的算法。
First, how did the name logistic regression come from? We need to split the name into several pieces.
首先,对数几率这个名字是怎么来的呢?这里我们需要把它拆解成几个词来逐一解释。
odds(几率)
The odds are defined as the probability that the event will occur divided by the probability that the event will not occur.
几率, 是指一件事发生和不发生的概率的比值。
logit and logit model(对数几率和对数几率模型)
In statistics, the logit function or the log-odds is the logarithm of the odds where p is a probability the event occurs.
在统计学中,对数几率函数即是对几率取自然对数e的函数,其中的p为事件发生的概率。
Intuitively, logit is the logarithm of the ratio of the probability of an event occurring and not occurring. It can be understood to describe the possibility of a thing happening and not happening, and a logarithmic scaling is performed on it. (Because the logarithmic function itself is monotonically increasing, it will not affect the relative position of the original value). What we have to do is to use the generalized linear regression model to fit it.
直观上看, 对数几率就是一件事发生和不发生可能性的比值的对数。可以理解它在描述一件事情发生与不发生的可能性,并对其进行了一次对数缩放。(由于对数函数本身是单调递增的,因此也不会对原值相对位置产生影响)。而我们要做的就是利用广义线性回归模型来对其进行拟合。
Generalized Linear Regression (广义线性回归)
Suppose we can make a model that takes the linear regression model output as input and outputs categorical outcomes. By intuition, we can simply add an unit-step function for the linear regression model. However, But the problem is that the unit-step function is not continuously differentiable, which makes it impossible for us to update the parameters through some optimization algorithms such as gradient descent in back propagation. We are actually using linear regression to approximate the true logarithmic probability. This also reflects the meaning of the logistic Regression.
假设我们可以建立一个将线性回归模型输出作为输入并输出分类结果的模型。那么直觉上,我们可以简单地为线性回归模型添加一个单位阶跃函数, 将连续值转化为离散值即可。但一般的单位阶跃函数存在的一个问题就是不是连续可导的,使得我们在反向传播中无法通过梯度下降等一些优化算法进行参数更新。因此,我们实际上是使用线性回归来近似真实的对数概率。这也反映了对数概率名称的含义。
So why should we approach this way? Intuitively, we are approximating this logarithmic scaling result through a linear model, and constraining probability p is in the (0,1) interval (if the value of the linear part tends to be infinite, then p is infinitely close to 1,otherwise, p tends to be 0), which corresponds to the range of probability. From another perspective, we are actually using logistic distribution function (cdf) instead of the original linear probability model to fit the true distribution, because it is smooth compared to the unit-step function, which is continuous and derivable at any order, solving parameter updating problem during backpropagation. The following process can verify this statement.
那么我们为什么要这样去逼近呢?直观上看,我们是在通过线性模型来逼近这个对数缩放的结果,约束概率p处于(0,1)区间内(线性部分取值趋于无限大,则p无限趋近于1,反之p无限趋近于0),从而对应概率的值域。而从另一角度上来说,我们其实是在使用逻辑斯蒂的分布函数(cdf)代替原有的线性概率模型来拟合真实的分布,因为相比与单位跃迁函数而言,它是平滑的,连续且任意阶可导,解决了反向传播时参数更新的问题。下面的求解过程即可验证这一说法。
Take exponential operate with base e on both side
对两侧同时进行指数运算取e解反函数,则有
As we we can see, the above formula reveals the logistic function, and it shows that logistic regression is actually a generalized linear regression model. It is a non-linear classification model by adding a non-linear differentiable link function to the linear model. It can be seen that ln(z) in our logistic regression is a special case of g(z).
可以看到上式体现了逻辑斯蒂分布函数,且说明对数几率回归本质上实际上是一种广义线性回归模型。它是一个通过添加一个非线性的可微分联系函数(link function)来实现非线性的分类模型。可见此处对数几率回归中的ln(z)就是g(z)的特例。而输出值即是事件发生的概率。
sigmoid function (sigmoid函数)
By now, as we can see in the picture, sigmoid or logistic function takes any input (-inf, inf) and map it to a value ranging within(0, 1). Moreover, as a smooth function, which is infinitely differentiable, sigmoid function is a good mapping function. By default, when h(z) >= 0.5, we judge y to be 1, otherwise y=0. It can also be seen that in fact, logistic regression first calculates the decision boundary, and then make prediction based on by the relative position of the decision boundary.
至此,从图中我们可以看出, S型函数或者logisitc function将任何输入值 (-inf,inf) 映射到(0,1)区间内的值,作为一个光滑函数,且其任意阶连续可微, 因此是一个很好的映射概率的函数。默认情况下,当h(z) >= 0.5时,我们判定y为1,否则y = 0。由此也可以看出实际上逻辑回归先计算决策边界,然后通过决策边界和数据点的相对位置来判断。
However, sigmoid function also has a few disadvantages as well. For example, the gradient will be pretty small when the input value is large, which may cause gradient vanishing and dramatically slow down the training speed in deep learning. We will talk about the solutions in our deep learning section, such as relu function, batch normalization, etc.
然而, sigmoid函数也有一些缺点. 例如,当输入值较大时,两侧的梯度将非常小,这可能在深度学习中会导致梯度消失并大大减慢的训练速度。而我们将在“深度学习”部分中讨论应对手段,如使用relu函数和批量归一化等方式。
Bernoulli Distribution (伯努利分布):
Since X obeys the Bernoulli distribution, we can define the likelihood function :
由于X服从伯努利分布,我们则可定义似然函数:
maximum likelihood estimation (极大似然估计)
In order to prevent arithmetic underflow, the result of a calculation is a number of smaller absolute value than the computer can actually represent, we take the logarithm of the likelihood function, making continuous multiplication into continuous addition.Because the log function itself is monotonically increasing, it will not affect the optimization results.
为防止算术下溢,即计算机浮点数计算的结果小于可以表示的最小数,我们将似然函数L取对数得到对数似然, 使得连乘变为连加,并进行极大似然估计,即对其进行梯度下降求解最优参数。因为log函数本身是单调递增的,所以不会对优化结果产生影响。
log-likelihood function (对数似然函数)
Since we want to maximize this log-likelihood function, we can get the negative log-likelihood function by adding a negative sign to the right side of the equation, then we will get the loss function of logistic regression, which is also the binary classification of cross entropy function(but it calculates the average) in deep learning. In this way, the original maximization problem is transformed into a minimization problem, and the optimal parameters can be obtained through the gradient descent method.
由于我们要最大化这个对数似然函数,我们可以通过在等式右侧添加符号得到负对数似然函数,也就是对数几率回归的损失函数,也是我们在深度学习中颇为常见的交叉熵函数的二分类特例(只不过计算的是平均值)。这样原本最大化问题转化为最小化问题,通过梯度下降方式即可求得最优参数。
Loss Function (损失函数)
Crosss Entropy Loss Function (交叉熵损失函数)
求解最优参数一般有梯度下降法,正规方程法,牛顿法等。
Solving the optimal parameters generally includes gradient descent method, normal equation method, Newton method, etc.
gradient descent types (梯度下降法的策略)
- Batch Gradient Descent (BGD)
批量梯度下降 - Stochastic Gradient Descent (SGD)
随机梯度下降 - Mini-batch Gradient Descent (MBGD)
迷你批量梯度下降
Regularization (正则化)
- L1 (Lasso) regularization
对应参数向量1范数,得到稀疏解 - L2 (Ridge) regularization
对应2范数,
multi-classification (多分类)
- One vs One
一对一 - One vs All
一对多 - Softmax
softmax函数
**Feeling 感想: **
For the first time in my life, I was free to write a blog. I planned to pick a relatively simple model to start with. However, it turned out that the time I spent on writing this article was far beyond my expectations... The biggest reason was how to use a smooth idea to integrate all the knowledge models involved. Because although different reference materials are explaining the same model, their starting points and angles are different. For example, some articles explain too simplistically, and thus do not have much learning from; while other articles explain more detailed, but due to too much design knowledge, the cost of reading and learning articles is too large. For example, some people choose to elicit concepts such as probability from logistic regression, while others choose to start with the linear approximation part to reverse logistic. Therefore, I spent some time to write such an article that I thought was a relatively easy-to-understand article, and explained the ins and outs of the model as much as possible on the premise that everyone could understand it. However, limited by space, this article doesn't explain too much on some underlying linear models and statistical knowledge of data distribution, especially the relationship between Bernoulli distribution and probability, linear model expectation and variance homogeneity issues, and some derivation processes.
人生第一次闲来写写博客,想着挑个相对简单的模型入手,结果发现本文创作的时常远超自己预想... 其中最大的原因是在于如何用顺畅的思路将模型所涉及的各个知识点串联并展示出来。因为虽然不同的参考资料都在解释同一模型,但它们出发起点和角度却各不相同。比如有的文章解释的过于简单,从而起不到太大的学习的作用,而另一些文章则解释的比较细,但由于设计的知识太多导致文章的阅读和学习成本太大。还比如有的人选择从逻辑斯蒂回归引出几率等概念,而有的人选择从线性逼近部分入手来反推逻辑斯蒂。因此,我也是花费了大量时间来以我个人思路写了这么一篇自以为比较好理解的文章,本着大家都能看懂的前提下尽可能得解释模型的来龙去脉。但受限于篇幅,本文省略了一些底层的线性模型和数据分布的统计知识,尤其对伯努利分布与概率的联系,线性模型期望和方差同方性问题和一些求导过程没做过多展开。现在就先挖个坑,万一哪天填了呢...
Author: Calvin Cao
Reference:
Machine Learning - Zhi-Hua Zhou
Machine Learning - Andrew Ng