Logistic Regression (对数几率回归)

Hi everyone,
大家好,

Today, let's talk about logistic regression.
今天这篇文章一起学讨论对数几率回归算法。

In short, logistic regression assumes that the data obeys the Bernoulli distribution, and uses maximum likelihood estimation (MLE) and gradient descent (GD) to solve the parameters to achieve the purpose of binary classification of the data.
简而言之, 对数几率回归是一种假设数据服从伯努利分布, 利用极大似然估计梯度下降法来求解最优参数的算法。

First, how did the name logistic regression come from? We need to split the name into several pieces.
首先,对数几率这个名字是怎么来的呢?这里我们需要把它拆解成几个词来逐一解释。

odds(几率)
The odds are defined as the probability that the event will occur divided by the probability that the event will not occur.
几率, 是指一件事发生和不发生的概率的比值。
odds = \frac{p}{1-p}

logit and logit model(对数几率和对数几率模型)
In statistics, the logit function or the log-odds is the logarithm of the odds where p is a probability the event occurs.
在统计学中,对数几率函数即是对几率取自然对数e的函数,其中的p为事件发生的概率。

logit(p) = log_e(odds) = ln(\frac{p}{1-p}) \;\;\;\;\; p \in (0,1)

Intuitively, logit is the logarithm of the ratio of the probability of an event occurring and not occurring. It can be understood to describe the possibility of a thing happening and not happening, and a logarithmic scaling is performed on it. (Because the logarithmic function itself is monotonically increasing, it will not affect the relative position of the original value). What we have to do is to use the generalized linear regression model to fit it.
直观上看, 对数几率就是一件事发生和不发生可能性的比值的对数。可以理解它在描述一件事情发生与不发生的可能性,并对其进行了一次对数缩放。(由于对数函数本身是单调递增的,因此也不会对原值相对位置产生影响)。而我们要做的就是利用广义线性回归模型来对其进行拟合。

Generalized Linear Regression (广义线性回归)
Suppose we can make a model that takes the linear regression model output as input and outputs categorical outcomes. By intuition, we can simply add an unit-step function for the linear regression model. However, But the problem is that the unit-step function is not continuously differentiable, which makes it impossible for us to update the parameters through some optimization algorithms such as gradient descent in back propagation. We are actually using linear regression to approximate the true logarithmic probability. This also reflects the meaning of the logistic Regression.
假设我们可以建立一个将线性回归模型输出作为输入并输出分类结果的模型。那么直觉上,我们可以简单地为线性回归模型添加一个单位阶跃函数, 将连续值转化为离散值即可。但一般的单位阶跃函数存在的一个问题就是不是连续可导的,使得我们在反向传播中无法通过梯度下降等一些优化算法进行参数更新。因此,我们实际上是使用线性回归来近似真实的对数概率。这也反映了对数概率名称的含义。

ln(\frac{p}{1-p}) = \mathbf{W}^\mathrm{T}· X+b

So why should we approach this way? Intuitively, we are approximating this logarithmic scaling result through a linear model, and constraining probability p is in the (0,1) interval (if the value of the linear part tends to be infinite, then p is infinitely close to 1,otherwise, p tends to be 0), which corresponds to the range of probability. From another perspective, we are actually using logistic distribution function (cdf) instead of the original linear probability model to fit the true distribution, because it is smooth compared to the unit-step function, which is continuous and derivable at any order, solving parameter updating problem during backpropagation. The following process can verify this statement.
那么我们为什么要这样去逼近呢?直观上看,我们是在通过线性模型来逼近这个对数缩放的结果,约束概率p处于(0,1)区间内(线性部分取值趋于无限大,则p无限趋近于1,反之p无限趋近于0),从而对应概率的值域。而从另一角度上来说,我们其实是在使用逻辑斯蒂的分布函数(cdf)代替原有的线性概率模型来拟合真实的分布,因为相比与单位跃迁函数而言,它是平滑的,连续且任意阶可导,解决了反向传播时参数更新的问题。下面的求解过程即可验证这一说法。

Take exponential operate with base e on both side
对两侧同时进行指数运算取e解反函数,则有
e^{ln(\frac{p}{1-p})} = e^{\mathbf{W}^\mathrm{T}·X+b}
\frac{p}{1-p} = e^{(\mathbf{W}^\mathrm{T}·X+b)}
p = \frac{1}{ 1+e^{-(\mathbf{W}^\mathrm{T}·X+b)}}
y =p= g^{-1}(\mathbf{W}^\mathrm{T}·X+b)
As we we can see, the above formula reveals the logistic function, and it shows that logistic regression is actually a generalized linear regression model. It is a non-linear classification model by adding a non-linear differentiable link function to the linear model. It can be seen that ln(z) in our logistic regression is a special case of g(z).
可以看到上式体现了逻辑斯蒂分布函数,且说明对数几率回归本质上实际上是一种广义线性回归模型。它是一个通过添加一个非线性的可微分联系函数(link function)来实现非线性的分类模型。可见此处对数几率回归中的ln(z)就是g(z)的特例。而输出值即是事件发生的概率。

sigmoid function (sigmoid函数)

h(z) = \frac{1}{1 + e^{-z}}

sigmoid function

By now, as we can see in the picture, sigmoid or logistic function takes any input (-inf, inf) and map it to a value ranging within(0, 1). Moreover, as a smooth function, which is infinitely differentiable, sigmoid function is a good mapping function. By default, when h(z) >= 0.5, we judge y to be 1, otherwise y=0. It can also be seen that in fact, logistic regression first calculates the decision boundary, and then make prediction based on by the relative position of the decision boundary.
至此,从图中我们可以看出, S型函数或者logisitc function将任何输入值 (-inf,inf) 映射到(0,1)区间内的值,作为一个光滑函数,且其任意阶连续可微, 因此是一个很好的映射概率的函数。默认情况下,当h(z) >= 0.5时,我们判定y为1,否则y = 0。由此也可以看出实际上逻辑回归先计算决策边界,然后通过决策边界和数据点的相对位置来判断。

However, sigmoid function also has a few disadvantages as well. For example, the gradient will be pretty small when the input value is large, which may cause gradient vanishing and dramatically slow down the training speed in deep learning. We will talk about the solutions in our deep learning section, such as relu function, batch normalization, etc.
然而, sigmoid函数也有一些缺点. 例如,当输入值较大时,两侧的梯度将非常小,这可能在深度学习中会导致梯度消失并大大减慢的训练速度。而我们将在“深度学习”部分中讨论应对手段,如使用relu函数和批量归一化等方式。

Bernoulli Distribution (伯努利分布):
f(k) = P(k) = p^k(1-p)^{1-k}

Since X obeys the Bernoulli distribution, we can define the likelihood function L(\theta):
由于X服从伯努利分布,我们则可定义似然函数L(\theta)
{\begin{aligned} L(\theta \mid x)&=P(Y\mid X;\theta )\\&=\prod _{i}^{m}P(y_{i}\mid x_{i};\theta )\\&=\prod _{i}^{m}h_{\theta }(x_{i})^{y_{i}}(1-h_{\theta }(x_{i}))^{(1-y_{i})} \end{aligned}}

maximum likelihood estimation (极大似然估计)
In order to prevent arithmetic underflow, the result of a calculation is a number of smaller absolute value than the computer can actually represent, we take the logarithm of the likelihood function, making continuous multiplication into continuous addition.Because the log function itself is monotonically increasing, it will not affect the optimization results.
为防止算术下溢,即计算机浮点数计算的结果小于可以表示的最小数,我们将似然函数L取对数得到对数似然, 使得连乘变为连加,并进行极大似然估计,即对其进行梯度下降求解最优参数。因为log函数本身是单调递增的,所以不会对优化结果产生影响。

log-likelihood function (对数似然函数)
L(\theta \mid x) = \log(P({Y}\mid{h_{\theta }(X})) = \sum_{i=1}^{m} y_i \log(h_{\theta}({x}_i)) + (1-y_i) \log(1 - h_{\theta}({x}_i))
Since we want to maximize this log-likelihood function, we can get the negative log-likelihood function by adding a negative sign to the right side of the equation, then we will get the loss function of logistic regression, which is also the binary classification of cross entropy function(but it calculates the average) in deep learning. In this way, the original maximization problem is transformed into a minimization problem, and the optimal parameters can be obtained through the gradient descent method.
由于我们要最大化这个对数似然函数,我们可以通过在等式右侧添加符号得到负对数似然函数,也就是对数几率回归的损失函数,也是我们在深度学习中颇为常见的交叉熵函数的二分类特例(只不过计算的是平均值)。这样原本最大化问题转化为最小化问题,通过梯度下降方式即可求得最优参数。

Loss Function (损失函数)
J(\theta)=-\sum_{i}^{m} Y log(\hat{Y}) - (1-Y) log(1-\hat{Y})
Crosss Entropy Loss Function (交叉熵损失函数)
CrossEntropy(Y, \hat{Y})=- \frac{1}{m}\sum_{i}^{m} \sum_{c}^{N_c}Y \log (\hat{Y})

求解最优参数一般有梯度下降法,正规方程法,牛顿法等。
Solving the optimal parameters generally includes gradient descent method, normal equation method, Newton method, etc.

gradient descent types (梯度下降法的策略)

  • Batch Gradient Descent (BGD)
    批量梯度下降
  • Stochastic Gradient Descent (SGD)
    随机梯度下降
  • Mini-batch Gradient Descent (MBGD)
    迷你批量梯度下降

Regularization (正则化)

  • L1 (Lasso) regularization
    对应参数向量1范数,得到稀疏解
  • L2 (Ridge) regularization
    对应2范数,

multi-classification (多分类)

  • One vs One
    一对一
  • One vs All
    一对多
  • Softmax
    softmax函数

**Feeling 感想: **
For the first time in my life, I was free to write a blog. I planned to pick a relatively simple model to start with. However, it turned out that the time I spent on writing this article was far beyond my expectations... The biggest reason was how to use a smooth idea to integrate all the knowledge models involved. Because although different reference materials are explaining the same model, their starting points and angles are different. For example, some articles explain too simplistically, and thus do not have much learning from; while other articles explain more detailed, but due to too much design knowledge, the cost of reading and learning articles is too large. For example, some people choose to elicit concepts such as probability from logistic regression, while others choose to start with the linear approximation part to reverse logistic. Therefore, I spent some time to write such an article that I thought was a relatively easy-to-understand article, and explained the ins and outs of the model as much as possible on the premise that everyone could understand it. However, limited by space, this article doesn't explain too much on some underlying linear models and statistical knowledge of data distribution, especially the relationship between Bernoulli distribution and probability, linear model expectation and variance homogeneity issues, and some derivation processes.
人生第一次闲来写写博客,想着挑个相对简单的模型入手,结果发现本文创作的时常远超自己预想... 其中最大的原因是在于如何用顺畅的思路将模型所涉及的各个知识点串联并展示出来。因为虽然不同的参考资料都在解释同一模型,但它们出发起点和角度却各不相同。比如有的文章解释的过于简单,从而起不到太大的学习的作用,而另一些文章则解释的比较细,但由于设计的知识太多导致文章的阅读和学习成本太大。还比如有的人选择从逻辑斯蒂回归引出几率等概念,而有的人选择从线性逼近部分入手来反推逻辑斯蒂。因此,我也是花费了大量时间来以我个人思路写了这么一篇自以为比较好理解的文章,本着大家都能看懂的前提下尽可能得解释模型的来龙去脉。但受限于篇幅,本文省略了一些底层的线性模型和数据分布的统计知识,尤其对伯努利分布与概率的联系,线性模型期望和方差同方性问题和一些求导过程没做过多展开。现在就先挖个坑,万一哪天填了呢...

Author: Calvin Cao
Reference:
Machine Learning - Zhi-Hua Zhou
Machine Learning - Andrew Ng

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 213,616评论 6 492
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,020评论 3 387
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 159,078评论 0 349
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,040评论 1 285
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,154评论 6 385
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,265评论 1 292
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,298评论 3 412
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,072评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,491评论 1 306
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,795评论 2 328
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,970评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,654评论 4 337
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,272评论 3 318
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,985评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,223评论 1 267
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,815评论 2 365
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,852评论 2 351

推荐阅读更多精彩内容