1. 数据维度
PCA 主成分分析
principle component analysis
PCA是一套全面应用于各类数据分析的分析方法,包括特征集压缩feature set compression
每当进行数据可视化的时候,都可以应用主成分分析
二维数据
一维数据
并非严格一维数据,某些地方会出现一些偏差,但是为了理解这些数据,我乐意将这些偏差信息看成是干扰信息,并将其看作是一维的数据集:
PCA特别擅长处理坐标系的移位和旋转
6. 用于数据转换的PCA
如果你拥有的任何形状的数据,
PCA finds a new coordinate system that's obtained from the old one by translation and rotation only
PCA moves the center of the coordinate system with the center of the data
PCA move the x-axis into the principle axis of variation ,where you see the most variation relative to all the data points
PCA move the y-axis down the road into a orthogonal less important directions of variation
主成分分析为你找到这些轴,并告诉你这些轴的重要性
7. 新坐标系的中心
(2,3)
8. 新坐标系的主轴
△x=1
△y=1
9.新系统的第二主成分
△x=-1
△y=1
在PCA分析法中书写向量时,最低输出向量值被规定为1
归一化 PCA 成分向量后,
△x(黑)= 根号 2 分之一
△y(黑)= 根号 2 分之一 #新的x轴
新的x轴和新的y轴所属的向量是正交的
△x(红)= 负根号 2 分之一
△y(红)= 根号 2 分之一 #新的y轴
11. 练习查找新轴
通过PCA还可以得出一个重要值,那就是轴的散布值 spread value
如果散布率较小,那么散布值对于主轴来说倾向于是一个很大的值,而对于第二主成分轴来说则小很多
12. 哪些数据可用于PCA
Part of the beauty of PCA is that the data doesn't have to be perfectly 1D in order to find the principal axis!
13. 轴何时占主导地位
长轴是否占优势
所谓长轴占优势是指轴的重要值importance value,或者说长轴特征值要大于短轴的特征值
14.可测量的特征与潜在的特征练习
给定一些房屋的参数,如果想预测它的价格,该使用以下那个算法呢?
□ 决策树分类器
□ SVC
□ √线性回归
因为我们预期的输出是连续性的,所以使用分类器是不合适的
15. 从四个特征到两个
给定一些房屋的参数,预测它的价格
可衡量特征:
square footage
no. of rooms
school ranking
neighborhood safety
潜在特征
size
neighborhood
16. 在保留信息的同时压缩
将四项特征压缩为两项,以便我们能真正获得核心的信息的最好方法是什么?
我们实际要调查的是size
neighborhood
这两个特征
哪个是最合适的选择参数的工具?
□ SelectKBest(K 为要保留的特征数量)
□ √ SelectPercentile 指定你希望保留的特征的百分比
因为我们已知希望得到两个特征,所以使用SelectKBest,它将保留最强大的两个特征,并抛弃除此之外的所有其他特征
如果我们知道本来有多少个可选特征,也知道最后需要多少个特征,那么也可以使用 SelectPercentile
17.复合特征
我有很多特征可以使用,但是假设只有一小部分特征在驱动数据模式,然后我将根据这个找出一个复合特征,以便弄清楚潜在的现象
这里的复合特征/组合特征,也被称为主要成分principle component ,是一个非常强大的算法,本课中,我们主要在特征降维的情况中讨论它,降低特征的维度,从而将一大堆特征缩减至几个特征
PCA也是非监督学习中一种非常强大的独立算法
例子:将square footage
no.room
转化成size
上图看上去有些像线性回归,但是PCA并不是线性回归,线性回归的目的是预测与输入值相对应的输出值,而PCA不是要预测任何值,而是算出数据的大致方向,使得我们的数据能够在尽可能少地损失信息的同时映射在该方向上
在我找到了主成分,也就是这个向量的方向后,我会对所有的数据点进行一个处理,这里称为映射,数据最初是二维的,但是在我把它映射到主成分上后,它就变成了一维数据
18. 最大方差
variance
- the willingness/flexibility of an algorithm to learn
-
technical term in statistics -- roughly the 'spread' of a data distribution(similar to standard deviation)
对于具有较大方差的特征,它的样本散布的值范围极大,若方差较小,则各个特征通常是紧密聚集在一起
在上图中,在数据周边画一个椭圆,使得椭圆内包含大部分数据,这个椭圆可以用两个数字的参数来表示,即椭圆的长轴距离和短轴距离,那么在这两条线中,哪一条线所指的方向是数据的最大方差?即哪一个方向上的数据更为分散?
长轴的线是数据最大方差的方向
19. 最大方差的优点
principal component of a data set is the direction that has the largest variance because ?
why do you think we define the principle component this way?
what's the advantage of looking for the direction that has the largest variance?
when we are doing our project of these two dimension feature space down on to one dimension,why do we project all the data points down onto this heavy red line instead of projecting them onto this shorter line?
□ 计算复杂度低
□ √可以最大程度保留来自原始数据的信息量
□ 只是一种惯例,并没有什么实际的原因
当我们沿着最大方差的维度进行映射时,它能够保留原始数据中最多的信息
20. 最大方差与信息损失
safety problems
+ school ranking
→(PCA) neighborhood quality
find the direction of maximal variance
最大方差的方向就是将信息的损失减到最小的方向
当我将这些二维的点投射到这条一维的线上时,就会丢失信息,丢失的信息量等于某个特定的点与它在这条线上的新位置之间的距离
21. 信息损失和主成分
信息丢失:各个点与其在该线上的新特征上新投影的点之间的距离总和
当我们将方差最大化的同时,我们实际上是将点与其在该线上的投影之间的距离最小化
projection onto direction of maximal variance minimizes distance from old(higher-dimensional) data point to its new transformed value
→ minimizes information loss
23. 用于特征转换的 PCA
PCA as a general algorithm for feature transformation
我们将所有这四个特征一起放入PCA中,它可以自动将这些特征结合成新的特征,并对这些新特征的相对能力划分等级,如果我们的案例中有两个隐藏特征推动数据中大部分变化,那么PCA将选出这些特征,并将其作为第一和第二主成分,第一个主成分即影响最大的特征。
由于第一个主成分是混合产生的,可能包含所有特征或多或少的元素,但是该非监督学习算法非常强大,可以帮助你从根本上了解数据中的隐藏特征,如果你对房价一无所知,PCA仍可让你获得自己的见解,如总体上有两个因素推动房价的变动,至于这两个因素是不是neighborhood
和size
,则取决于你自己,现在除了进行降维操作,你还会了解到有关数据变化的模式的重要信息
25. PCA 的回顾/定义
review/definition of PCA
- systematized way to transform input features into principal component
- use principal components as new features in regression/classification
- you can also rank the principle components,the more variance you have of the data along a given principal component,the higher that principal component is ranked.so the one that has the most variance will be the first principal component,second will be the second principal component,and so on .
- the principal components are all perpendicular to each other in a sense,so the second principal component is mathematically guaranteed to not overlap at all with the first principal component,and the third will not overlap with the first through the second ,and so on.so you can treat them as independent features in a sense.
- there is a maximum number of principal components you can find,it's equal to the number of input features that you had in you data set.usually, you'll only use the first handful of principal components,but you could go all the way out and use the maximum number,in that case though,you are not really gaining anything,you're just representing your features in a different way,so the PCA won't give you the wrong answer,but it doesn't give you any advantages over just using the original input features if you're using all of the principal components together in a regression or classification task.
26. 将 PCA 应用到实际数据
在以下几段视频中,Katie 和 Sebastian 研究安然的一些财务数据,并着眼于 PCA 的应用。
请记住,要获得包含项目代码的版本库以及此数据集,请访问以下网址:
https://github.com/udacity/ud120-projects
安然数据位于:final_project/
28. sklearn 中的 PCA
def doPCA():
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(data)
returen pca
pca = doPCA()
print pca.explained_variance_ratio_ #方差比,是特征值的具体表现形式,可以了解第一/二个主成分占数据变动的百分比
first_pc = pca.components_[0]
second_pc = pca.components_[1]
transformed_data = pca.transform(data)
for ii,jj in zip(transformed_data,data):
plt.scatter(first_pc[0]**ii[0],first_pc[1]**ii[0],color='r')
plt.scatter(second_pc[0]**ii[1],second_pc[1]**ii[1],color='c')
plt.scatter(jj[0],jj[1],color='b')
29.何时使用 PCA
- latent features driving the patterns in data(big shots at Enron)
if you want to access to latent features that you think might be showing up in the patterns in your data,maybe the entire point of what you're trying to do is figure out if there's a latent feature,in other words,you just want to know the size of the first principal components,then measure who the big shots are at Enron. - dimensionality reduction
-- visualize high dimensional data
sometimes you will have more than two features,you have to represent three or four or many numbers about a data point if you only have two dimensions in which to draw ,and so what you can do is project it down to the first two principal components and just plot that,and just draw that scatter plot.
-- reduce noise
the hope is that the first or the second,your strongest principal components are capturing the actual patterns in the data,and the smaller principle components are just representing noisy variations about those patterns,so by throwing away the less important principle components,you're getting rid of that noise.
-- make other algorithms(regression,classification) work better with fewer inputs(eigenfaces)
using PCA as pre-processing before you use another algorithm,so a regression or a classification task,if you have very high dimensionality, and if you have a complex,say,classification algorithm,the algorithm can be very high variance,it can end up fitting to noise in the data,it can end up running really low,there are lots of things that can happen when you have very high input dimensionality with some of these algorithms,but, of course,the algorithm might work really well for the problem at hand,so one of the things you can do is use PCA to reduce the dimensionality of your input features,so that then your,say classification algorithm works better.
in the example of eigenfaces,a method of applying PCA to pictures of people,this is a very high dimensionality space,you have many many pixels in the picture,but say,you want to identify who is pictured in the image,you are running some kind of facial identification,so with PCA you can reduce the very high input dimensionality into something that's maybe a factor of ten lower,and feed this into SVM,which can then do the actual classification of trying to figure out who's pictured,so now the inputs ,instead of being the original pixels or the images,are the principal components.
30. 用于人脸识别的PCA
PCA for facial recognition
what makes facial recognition in pictures good for PCA?
□ √pictures of faces generally have high input dimensionality (many pixels)
人脸照片通常有很高的输入维度(很多像素)
在这种情况下,缩减非常重要,因为SVM很难处理一百万个特征
□ √faces have general patterns that could be captured in smaller number of dimensions(two eyes on top,mouth /chin on bottom,etc.)
人脸具有一些一般性形态,这些形态可以以较小维数的方式捕捉,比如人一般都有两只眼睛,眼睛基本都位于接近脸的顶部的位置
在两张头像中,并不是一百万个像素点都存在差异,而是只有几个主要的差异点,我们或许可以用PCA挑选出这些点,并让它们发挥最大用处
□ ×facial recognition is simple using machine learning(humans do it easily)
使用机器学习技术,人脸识别是非常容易的(因为人类可以轻易做到)
很难用决策树来实现人脸识别
31. 特征脸方法代码
在人脸识别中,结合使用PCA和SVM是很强大的
"""
===================================================
Faces recognition example using eigenfaces and SVMs
===================================================
The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:
http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)
.. _LFW: http://vis-www.cs.umass.edu/lfw/
original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html
"""
print __doc__
from time import time
import logging
import pylab as pl
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')
###############################################################################
# Download the data, if not already on disk and load it as numpy arrays
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
np.random.seed(42)
# for machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]
# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]
print "Total dataset size:"
print "n_samples: %d" % n_samples
print "n_features: %d" % n_features
print "n_classes: %d" % n_classes
###############################################################################
# Split into a training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
###############################################################################
# Compute a PCA (eigenfaces) on the face dataset (treated as unlabeled
# dataset): unsupervised feature extraction / dimensionality reduction
n_components = 150
print "Extracting the top %d eigenfaces from %d faces" % (n_components, X_train.shape[0])
t0 = time()
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train) #figuring out what the principle components are
print "the raio is ", pca.explained_variance_ratio_ #每个主成分的可释方差 0.19346534 0.15116844
print "done in %0.3fs" % (time() - t0)
eigenfaces = pca.components_.reshape((n_components, h, w)) #asks for the eigenfaces
print "Projecting the input data on the eigenfaces orthonormal basis"
t0 = time()
X_train_pca = pca.transform(X_train) #transform data into the principle components representation
X_test_pca = pca.transform(X_test)
print "done in %0.3fs" % (time() - t0)
###############################################################################
# Train a SVM classification model
print "Fitting the classifier to the training set"
t0 = time()
param_grid = {
'C': [1e3, 5e3, 1e4, 5e4, 1e5],
'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
# for sklearn version 0.16 or prior, the class_weight parameter value is 'auto'
clf = GridSearchCV(SVC(kernel='rbf', class_weight='balanced'), param_grid)
clf = clf.fit(X_train_pca, y_train) #SVC using the principle components as the features
print "done in %0.3fs" % (time() - t0)
print "Best estimator found by grid search:"
print clf.best_estimator_
###############################################################################
# Quantitative evaluation of the model quality on the test set
print "Predicting the people names on the testing set"
t0 = time()
y_pred = clf.predict(X_test_pca) #SVC try to identify in the test set who appears in a given picture.
print "done in %0.3fs" % (time() - t0)
print classification_report(y_test, y_pred, target_names=target_names)
print confusion_matrix(y_test, y_pred, labels=range(n_classes))
###############################################################################
# Qualitative evaluation of the predictions using matplotlib
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
pl.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())
# plot the result of the prediction on a portion of the test set
def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue: %s' % (pred_name, true_name)
prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape[0])]
plot_gallery(X_test, prediction_titles, h, w)
# plot the gallery of the most significative eigenfaces
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)
pl.show()
The eigenfaces are basically the principle components of the face data.
at last ,the algorithm will show you the eigenfaces.
在SVM中,将PCA产生的合成图像用作特征,在预测图片中的脸的身份时非常有用
33. PCA 迷你项目
我们在讨论 PCA 时花费了大量时间来探讨理论问题,因此,在此迷你项目中,我们将要求你写一些 sklearn 代码。特征脸方法代码很有趣,而且内容丰富,足以胜任这一整个迷你项目的试验平台。
可在 pca/eigenfaces.py 中找到初始代码。此代码主要取自此处 sklearn 文档中的示例。
请注意,在运行代码时,对于在 pca/eigenfaces.py
的第 94 行调用的 SVC
函数,有一个参数有改变。对于“class_weight”参数,参数字符串“auto”对于 sklearn 版本 0.16 和更早版本是有效值,但将被 0.19 舍弃。如果运行 sklearn 版本 0.17 或更高版本,预期的参数字符串应为“balanced”。如果在运行 pca/eigenfaces.py
时收到错误或警告,请确保第 98 行包含与你安装的 sklearn 版本匹配的正确参数。
sklearn 0.16或更早版本 class_weight='auto'
sklearn 0.16或更高版本 class_weight='balanced'
34.每个主成分的可释方差
我们提到 PCA 会对主成分进行排序,第一个主成分具有最大方差,第二个主成分 具有第二大方差,依此类推。第一个主成分可以解释多少方差?第二个呢?
print "the raio is ", pca.explained_variance_ratio_ #每个主成分的可释方差 0.19346534 0.15116844
第一主成分解释了多少变异量? 0.19346534
第二主成分呢? 0.15116844
我们发现,有时 Pillow 模块(本例中使用的)可能会造成麻烦。如果你收到与 fetch_lfw_people() 命令相关的错误,请尝试以下命令:
pip install --upgrade PILLOW
35.要使用多少个主成分?
现在你将尝试保留不同数量的主成分。在类似这样的多类分类问题中(要应用两个以上标签),准确性这个指标不像在两个类的情形中那么直观。相反,更常用的指标是 F1 分数f1-score
。
我们将在评估指标课程中学习 F1 分数f1-score
,但你自己要弄清楚好的分类器的特点是具有高 F1 分数f1-score
还是低 F1 分数f1-score
。你将通过改变主成分数量并观察 F1 分数f1-score
如何相应地变化来确定。
将更多主成分添加为特征以便训练分类器时,你是希望它的性能更好还是更差?
as you add more principal components as features for training your classifier,do you expect it to get better or worse performance?
□ √ could go either way
While ideally, adding components should provide us additional signal to improve our performance, it is possible that we end up at a complexity where we overfit.
36. F1 分数与使用的主成分数
将 n_components 更改为以下值:[10, 15, 25, 50, 100, 250]。对于每个主成分,请注意 Ariel Sharon 的 F1 分数。(对于 10 个主成分,代码中的绘制功能将会失效,但你应该能够看到 F1 分数。)
如果看到较高的 F1 分数,这意味着分类器的表现是更好还是更差?
Ariel Sharon f-score
n_components = 150 f-score=0.65
n_components = 10 f-score=0.11
n_components = 15 f-score=0.33
n_components = 50 f-score=0.67
n_components = 100 f-score=0.67
n_components = 250 f-score=0.62
if you see a higher f1-score ,dose it mean the classifier is doing better,or worse?
□ √ better
37. 维度降低与过拟合
在使用大量主成分时,是否看到过拟合的任何证据?PCA 维度降低是否有助于提高性能?
did you see any evidence of overfitting when using a large number of PCs?
□ √ yes,performance starts to drop with many PCs.
38. 选择主成分
selecting a number of principle components
think about selecting how many principle components you should look at.
there is no cut and dry answer for how many principle components you should use,you kind of have to figure it out
what's a good way to figure out how many PCs to use?
□ × just take top 10%
□ √train on different number of PCs,and see how accuracy responds-cut off when it becomes apparent that adding more PCs doesn't by you much more discrimination
□ × perform feature selection on input features before putting them into PCA,then use as many PCs as you have input features.
PCA is going to find a way to combine information from potentially many different input features together,so if you are throwing out input features before you do PCA,you are throwing information that PCA might be able to kind of rescue in a sense.it's fine to do feature selection on the principle components after you have make them,but you want to be very careful about throwing out information before performing PCA.
PCA can be fairly computationally expensive,so if you have a very large input feature space and you know that a lot of them are potentially completely irrelevant features. go ahead and try tossing them out,but proceed with caution.