ML4 - Classification

目录

0 前言
1 Discriminative Classifiers
1.1 Logistic Regression
1.2 Support Vector Machines(SVM)
1.3 Classification Tree
1.4 K-Nearest Neighbor Classifier(KNN)
2 Generative classifiers
2.1 Naive Bayes
2.2 Bayesian networks
3 Classification Model Evaluation
3.1 Confusion Matrix
3.2 ROC & AUC
4 Data Processing
4.1 Imbalanced Dataset

0 前言

Classification有两大分支。一种是Discriminative Modeling，另一种是Generative Modeling。比如，Logistic Regression是一种Discriminative Modeling，我们直接估计给定X的类别的后验概率，而不假设X的边际分布。

In General, a Discriminative Model models the decision boundary between the classes. A Generative Model explicitly models the actual distribution of each class. In final both of them is predicting the conditional probability $P(Class | Features)$ . But Both models learn different probabilities.

Training classifiers involve estimating f: X -> Y, or P(Y|X)
Generative classifiers
Assume some functional form for P(Y), P(X|Y)
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y |X)
Discriminative Classifiers
Assume some functional form for P(Y|X)
Estimate parameters of P(Y|X) directly from training data

1 Discriminative Classifiers

Discriminative Classifiers想要model decision boundary，这个boundary可以是线性的也可以是非线性的。

1.1 Logistic Regression

关于Linear classifier

Logistic Regression是以一种Linear classifier。所谓Linear classifier就是指，分割class的hyperplane是由X的线性组合张成的。

Linear classifier对决策边界做出线性的假设，即linear discriminant function长这样：

我们可以通过加入原始变量的函数所构成的新变量来扩展特征空间。扩展的特征空间中的Linear decision boundaries在原特征空间中可能是非线性的。
不同的linear classifier由于优化求解参数时候的目标方程不同而会给出不同的Linear decision boundary。

原理

对于binary class，在变量 $X_1,\dots,X_p$ 组成的向量空间中， $f(X)=0$ 是我们估计出来的linear decision boundary。

首先，我们确定 $f(X)$ 的形式
前提假设是： $f(X)$ 是log odds of $P(Class=+)$ 的estimation，且 $f(X)$ 是X的线性组合
然后，我们通过Maximum Likelihood Estimation估计参数 $\beta$ 的值（最大化的目标函数：maximum likelihood function）
最后，我们可以给出每一个observation属于某个class的probability

Multinomial logistic regression

假设我们有K+1个类别，我们使用One-versus-One(OvO)来分类，即给出 $\binom{K}{2}$ 对分类boundary。
我们仍然假设 $f_k(X)$ 是log odds of $P(Class=k)$ 的estimation，且 $f_k(X)$ 是 $X_k$ 的线性组合。经过计算，我们可以得出某一个obervation属于第k类的概率为：

通过比较

p_k(x)

的大小，我们选择概率最大的k作为predicted class。

MLR的一个应用：Discrete Choice Models

Multinomial logistic regression的一个应用为Discrete Choice Models，多出现在消费者选择研究方面。也就是说，当下消费者选择不同商品（alternatives）的时候，商品的不同特性（attributes）会不同程度地影响消费者对商品效用（utility）的感知，从而最后做出选择。

假设

DCM有两个假设：

Independent and identical Gumbel distribution：消费者选择时候的随机项iid服从Gumbel分布
Independence of Irrelevant Alternatives (IIA)：Probi / Probj 只依赖于他们自己的net utilities ui and uj，不依赖其他alternatives，即alternatives应该是同级别的，不应该出现hierarchical的情况

原理

消费者对alternative会有Net utility的感知，其为attributes的线性组合：
真正的utility是net utility和随机项的加总：
$V_i=U_i+\epsilon_i$
我们可知道，当一个商品总的效用大于另一个商品总效用的时候，消费者会选择前一个商品：

由于对随机项的假设，我们可以数学推导出选择商品i的概率：
当我们有training set的时候，我们可以估计所有 $\beta_i^{p}$ 的值，然后带入以上方差算出消费者选择商品 $i$ 的概率

详细推导，请看：https://mp.weixin.qq.com/s?__biz=MzU5ODA0OTU1NQ==&mid=2247484130&idx=1&sn=3783ec3342397d4dc3d4290208380288&chksm=fe4b549ec93cdd8887031447c0b1b2a92abbbd6c0ad9cc26588627720700c6a633d2e15bbb42&token=457707948&lang=zh_CN#rd

1.2 Support Vector Machines(SVM)

SVM不同的kernel会给出不同的decision boundary。

原理

对于Linear SVM，一句话来说，我们的目标是找到一个decision boundary，使得support vector之间的margin越大越好，同时，如果有point落在错误的一边，会给出以错误距离为比例的penalty。

从概念上讲，支持向量（Support Vectors）是基本的或关键的训练实例（数据），它们位于最接近决策边界的地方。

用数学公式表达的原理

优化求解的目标方程如下：

第一项是penalty term，如果有point落在wrong side越多，第一项数值会越大。第二项是margin的某比例倒数，margin越大，第二项越小。其中

\lambda

是一个tuning parameter，

\lambda

越小，model对mis-classified examples越不容忍，所以越容易over-fitting。

Kernel

如果是non-linear dicision boundary，kernel function会把现在存在的observations map到一个更高的维度空间，在高维空间中使用linear SVM。

核函数的价值在于它虽然也是将特征进行从低维到高维的转换，但它事先在低维上进行计算，而将实质上的分类效果表现在了高维上，也就避免了直接在高维空间中的复杂计算。

• Linear Kernel
• Polynomial Kernel
• Sigmoid Kernel
• Gaussian Radial Basis Function (RBF) Kernel

Multi-class SVM

解决multi-class有两种方法：

One-versus-All: If the number of classes is K > 2 then K different 2-class SVM classifiers are fitted where one class is compared with the rest of the classes combined. A new observation is classified according to where the classifier value is the largest. There are some issues for this one like imbalanced classes in training and scores not in common scale.
One-versus-One: All $\binom{K}{2}$ pairwise classifiers are fitted and a test observation is classified in the class which wins in the majority of the cases.
一般使用第二种方法，但是如果K很大，使用第一种更好。

Python code

# linear kernel
from sklearn import svm
classifier=svm.SVC(kernel='linear', gamma='scale') 
classifier.fit(train_data, train_labels)

# RBF kernel
classifier=svm.SVC(kernel='rbf', gamma='scale') 
classifier.fit(train_data, train_labels)

pred = classifier.predict(test_data)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pred, test_labels)

# confusion matrix
cm = metrics.confusion_matrix(test_labels, pred)

1.3 Classification Tree

相比于logistic regression和SVM使用hyperplanes作为classification boundaries，Classification trees是一种划分空间的hierarchical way。

1.4 K-Nearest Neighbor Classifier(KNN)

KNN的原理就是当预测一个新的值x的时候，根据它距离最近的K个点是什么类别来判断x属于哪个类别。

距离计算

要度量空间中点距离有好几种度量方式，比如常见的曼哈顿距离计算，欧式距离计算等等。通常KNN算法中使用的是欧式距离，距离计算公式如下：

多维空间欧式距离

特点

KNN是一种非参的，惰性的算法模型。

非参的意思并不是说这个算法不需要参数，而是意味着这个模型不会对数据做出任何的假设，与之相对的是线性回归（我们总会假设线性回归是一条直线）。也就是说KNN建立的模型结构是根据数据来决定的，这也比较符合现实的情况，毕竟在现实中的情况往往与理论上的假设是不相符的。
惰性又是什么意思呢？同样是分类算法，逻辑回归需要先对数据进行大量训练，最后才会得到一个算法模型。而KNN算法却不需要，它没有明确的训练数据的过程，或者说这个过程很快。

Python code

注意：KNN在数据预处理的时候一定要normalization，否则在分类的时候scale比较大的feature会主导分类。

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# Create standardizer
standardizer = StandardScaler()

# Standardize features
X_std = standardizer.fit_transform(X)

# Train a KNN classifier with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

knn.predict(X_test)
knn.predict_proba(X_test)

使用pipeline寻找最优的K：

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
# Create standardizer
standardizer = StandardScaler()

# Standardize features
features_standardized = standardizer.fit_transform(features)

# Create a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Create a pipeline
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

# Create space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

# Create grid search
classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)

# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]

2 Generative classifiers

2.1 Naïve Bayes

2.1.1 Naive Bayes Classifier

Naive Bayes Classifier是Bayes Theorem在classification task中的一个应用。

假设

特征之间的conditional independence
$argmax_{H} P(H\mid D)\\ = argmax_{H} P(H)P(D\mid H)$

本质上Naive Bayes Classifier首先估计出prior probability $p(H)$ 和likelihood $p(X|H)$ (likelihood在training data上由MLE估计出)，然后通过Bayes Theorem计算出不同class的posterior probability $p(H | X)$ ，最后通过MAP得出optimal class。

种类

两种NB的本质区别在于对feature分布的假设，导致对likelihood的计算方式不同。

Gaussian NB
Multinomial NB

参数设置

Additive Smoothing

2.2 Bayesian networks

3 Classification Model Evaluation

3.1 Confusion Matrix

由confusion matrix推导出的evaluation metrix还有如下：

Recall: How many positive cases did the model catch? As high as possible. (TP/TP+FN)
Precision: What percent of positive predictions were correct? As high as possible. (TP/TP+FP)
Accuracy: Out of all the classes, how many predicted correctly. As high as possible. (TP+TN/Total)
F1 score (F-measure): It is difficult to compare two models with low precision and high recall or vice versa. F1-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.

3.2 ROC & AUC

A ROC (Receiver Operating Characteristics) curve is used
to visualize the performance of a binary classifier.

AUC (Area Under Curve) is used to summarize performance in a single number. 取值范围为0到1，越靠近1说明model的分类能力越强。

It tells how much model can distinguish between classes.
Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. E.g., Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

4 Data Processing

4.1 Imbalanced Dataset

Imbalanced Dataset会影响classification的准确率。

解决方法

Random undersampling
Reduce majority class to match minority class count.
Random oversampling
Increase minority class by randomly picking samples within minority
class till counts of both class match.
Synthetic Minority Over-Sampling Technique (SMOTE)
Increase minority class by introducing synthetic examples through connecting all k (default = 5) minority class nearest neighbors using feature space similarity (Euclidean distance)

最后编辑于：2021.02.19 09:24:09

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 221,548评论 6赞 515
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 94,497评论 3赞 399
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 167,990评论 0赞 360
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 59,618评论 1赞 296
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 68,618评论 6赞 397
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 52,246评论 1赞 308
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 40,819评论 3赞 421
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 39,725评论 0赞 276
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 46,268评论 1赞 320
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 38,356评论 3赞 340
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 40,488评论 1赞 352
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 36,181评论 5赞 350
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 41,862评论 3赞 333
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 32,331评论 0赞 24
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 33,445评论 1赞 272
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 48,897评论 3赞 376
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 45,500评论 2赞 359

ML4 - Classification

目录

0 前言

1 Discriminative Classifiers

1.1 Logistic Regression

关于Linear classifier

原理

Multinomial logistic regression

MLR的一个应用：Discrete Choice Models

假设

原理

1.2 Support Vector Machines(SVM)

原理

用数学公式表达的原理

Kernel

Multi-class SVM

Python code

1.3 Classification Tree

1.4 K-Nearest Neighbor Classifier(KNN)

距离计算

特点

Python code

2 Generative classifiers

2.1 Naïve Bayes

2.1.1 Naive Bayes Classifier

假设

种类

参数设置

2.2 Bayesian networks

3 Classification Model Evaluation

3.1 Confusion Matrix

3.2 ROC & AUC

4 Data Processing

4.1 Imbalanced Dataset

解决方法

推荐阅读更多精彩内容