目录
0 前言
1 Discriminative Classifiers
1.1 Logistic Regression
1.2 Support Vector Machines(SVM)
1.3 Classification Tree
1.4 K-Nearest Neighbor Classifier(KNN)
2 Generative classifiers
2.1 Naive Bayes
2.2 Bayesian networks
3 Classification Model Evaluation
3.1 Confusion Matrix
3.2 ROC & AUC
4 Data Processing
4.1 Imbalanced Dataset
0 前言
Classification有两大分支。一种是Discriminative Modeling,另一种是Generative Modeling。比如,Logistic Regression是一种Discriminative Modeling,我们直接估计给定X的类别的后验概率,而不假设X的边际分布。
In General, a Discriminative Model models the decision boundary between the classes. A Generative Model explicitly models the actual distribution of each class. In final both of them is predicting the conditional probability . But Both models learn different probabilities.
Training classifiers involve estimating f: X -> Y, or P(Y|X)
Generative classifiers
Assume some functional form for P(Y), P(X|Y)
Estimate parameters of P(X|Y), P(Y) directly from training data
Use Bayes rule to calculate P(Y |X)
Discriminative Classifiers
Assume some functional form for P(Y|X)
Estimate parameters of P(Y|X) directly from training data
1 Discriminative Classifiers
Discriminative Classifiers想要model decision boundary,这个boundary可以是线性的也可以是非线性的。
1.1 Logistic Regression
关于Linear classifier
Logistic Regression是以一种Linear classifier。所谓Linear classifier就是指,分割class的hyperplane是由X的线性组合张成的。
Linear classifier对决策边界做出线性的假设,即linear discriminant function长这样:
我们可以通过加入原始变量的函数所构成的新变量来扩展特征空间。扩展的特征空间中的Linear decision boundaries在原特征空间中可能是非线性的。
不同的linear classifier由于优化求解参数时候的目标方程不同而会给出不同的Linear decision boundary。
原理
对于binary class,在变量组成的向量空间中,是我们估计出来的linear decision boundary。
-
首先,我们确定的形式
前提假设是:是log odds of 的estimation,且是X的线性组合
-
然后,我们通过Maximum Likelihood Estimation估计参数的值(最大化的目标函数:maximum likelihood function)
最后,我们可以给出每一个observation属于某个class的probability
Multinomial logistic regression
假设我们有K+1个类别,我们使用One-versus-One(OvO)来分类,即给出对分类boundary。
我们仍然假设是log odds of 的estimation,且是的线性组合。经过计算,我们可以得出某一个obervation属于第k类的概率为:
通过比较的大小,我们选择概率最大的k作为predicted class。
MLR的一个应用:Discrete Choice Models
Multinomial logistic regression的一个应用为Discrete Choice Models,多出现在消费者选择研究方面。也就是说,当下消费者选择不同商品(alternatives)的时候,商品的不同特性(attributes)会不同程度地影响消费者对商品效用(utility)的感知,从而最后做出选择。
假设
DCM有两个假设:
- Independent and identical Gumbel distribution:消费者选择时候的随机项iid服从Gumbel分布
- Independence of Irrelevant Alternatives (IIA):Probi / Probj 只依赖于他们自己的net utilities ui and uj,不依赖其他alternatives,即alternatives应该是同级别的,不应该出现hierarchical的情况
原理
-
消费者对alternative会有Net utility的感知,其为attributes的线性组合:
- 真正的utility是net utility和随机项的加总:
-
我们可知道,当一个商品总的效用大于另一个商品总效用的时候,消费者会选择前一个商品:
由于对随机项的假设,我们可以数学推导出选择商品i的概率:
- 当我们有training set的时候,我们可以估计所有的值,然后带入以上方差算出消费者选择商品的概率
1.2 Support Vector Machines(SVM)
SVM不同的kernel会给出不同的decision boundary。
原理
对于Linear SVM,一句话来说,我们的目标是找到一个decision boundary,使得support vector之间的margin越大越好,同时,如果有point落在错误的一边,会给出以错误距离为比例的penalty。
从概念上讲,支持向量(Support Vectors)是基本的或关键的训练实例(数据),它们位于最接近决策边界的地方。
用数学公式表达的原理
优化求解的目标方程如下:
第一项是penalty term,如果有point落在wrong side越多,第一项数值会越大。第二项是margin的某比例倒数,margin越大,第二项越小。其中是一个tuning parameter,越小,model对mis-classified examples越不容忍,所以越容易over-fitting。
Kernel
如果是non-linear dicision boundary,kernel function会把现在存在的observations map到一个更高的维度空间,在高维空间中使用linear SVM。
核函数的价值在于它虽然也是将特征进行从低维到高维的转换,但它事先在低维上进行计算,而将实质上的分类效果表现在了高维上,也就避免了直接在高维空间中的复杂计算。
• Linear Kernel
• Polynomial Kernel
• Sigmoid Kernel
• Gaussian Radial Basis Function (RBF) Kernel
Multi-class SVM
解决multi-class有两种方法:
- One-versus-All: If the number of classes is K > 2 then K different 2-class SVM classifiers are fitted where one class is compared with the rest of the classes combined. A new observation is classified according to where the classifier value is the largest. There are some issues for this one like imbalanced classes in training and scores not in common scale.
-
One-versus-One: All pairwise classifiers are fitted and a test observation is classified in the class which wins in the majority of the cases.
一般使用第二种方法,但是如果K很大,使用第一种更好。
Python code
# linear kernel
from sklearn import svm
classifier=svm.SVC(kernel='linear', gamma='scale')
classifier.fit(train_data, train_labels)
# RBF kernel
classifier=svm.SVC(kernel='rbf', gamma='scale')
classifier.fit(train_data, train_labels)
pred = classifier.predict(test_data)
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(pred, test_labels)
# confusion matrix
cm = metrics.confusion_matrix(test_labels, pred)
1.3 Classification Tree
相比于logistic regression和SVM使用hyperplanes作为classification boundaries,Classification trees是一种划分空间的hierarchical way。
1.4 K-Nearest Neighbor Classifier(KNN)
KNN的原理就是当预测一个新的值x的时候,根据它距离最近的K个点是什么类别来判断x属于哪个类别。
距离计算
要度量空间中点距离有好几种度量方式,比如常见的曼哈顿距离计算,欧式距离计算等等。通常KNN算法中使用的是欧式距离,距离计算公式如下:
特点
KNN是一种非参的,惰性的算法模型。
非参的意思并不是说这个算法不需要参数,而是意味着这个模型不会对数据做出任何的假设,与之相对的是线性回归(我们总会假设线性回归是一条直线)。也就是说KNN建立的模型结构是根据数据来决定的,这也比较符合现实的情况,毕竟在现实中的情况往往与理论上的假设是不相符的。
惰性又是什么意思呢?同样是分类算法,逻辑回归需要先对数据进行大量训练,最后才会得到一个算法模型。而KNN算法却不需要,它没有明确的训练数据的过程,或者说这个过程很快。
Python code
注意:KNN在数据预处理的时候一定要normalization,否则在分类的时候scale比较大的feature会主导分类。
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# Create standardizer
standardizer = StandardScaler()
# Standardize features
X_std = standardizer.fit_transform(X)
# Train a KNN classifier with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)
knn.predict(X_test)
knn.predict_proba(X_test)
使用pipeline寻找最优的K:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
# Create standardizer
standardizer = StandardScaler()
# Standardize features
features_standardized = standardizer.fit_transform(features)
# Create a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)
# Create a pipeline
pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])
# Create space of candidate values
search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]
# Create grid search
classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)
# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]
2 Generative classifiers
2.1 Naïve Bayes
2.1.1 Naive Bayes Classifier
Naive Bayes Classifier是Bayes Theorem在classification task中的一个应用。
假设
特征之间的conditional independence
本质上Naive Bayes Classifier首先估计出prior probability 和likelihood (likelihood在training data上由MLE估计出),然后通过Bayes Theorem计算出不同class的posterior probability,最后通过MAP得出optimal class。
种类
两种NB的本质区别在于对feature分布的假设,导致对likelihood的计算方式不同。
- Gaussian NB
- Multinomial NB
参数设置
Additive Smoothing
2.2 Bayesian networks
3 Classification Model Evaluation
3.1 Confusion Matrix
由confusion matrix推导出的evaluation metrix还有如下:
- Recall: How many positive cases did the model catch? As high as possible. (TP/TP+FN)
- Precision: What percent of positive predictions were correct? As high as possible. (TP/TP+FP)
- Accuracy: Out of all the classes, how many predicted correctly. As high as possible. (TP+TN/Total)
-
F1 score (F-measure): It is difficult to compare two models with low precision and high recall or vice versa. F1-score helps to measure Recall and Precision at the same time. It uses Harmonic Mean in place of Arithmetic Mean by punishing the extreme values more.
3.2 ROC & AUC
A ROC (Receiver Operating Characteristics) curve is used
to visualize the performance of a binary classifier.
AUC (Area Under Curve) is used to summarize performance in a single number. 取值范围为0到1,越靠近1说明model的分类能力越强。
- It tells how much model can distinguish between classes.
- Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. E.g., Higher the AUC, better the model is at distinguishing between patients with disease and no disease.
4 Data Processing
4.1 Imbalanced Dataset
Imbalanced Dataset会影响classification的准确率。
解决方法
-
Random undersampling
Reduce majority class to match minority class count. -
Random oversampling
Increase minority class by randomly picking samples within minority
class till counts of both class match. -
Synthetic Minority Over-Sampling Technique (SMOTE)
Increase minority class by introducing synthetic examples through connecting all k (default = 5) minority class nearest neighbors using feature space similarity (Euclidean distance)