目录
1 Dimension reduction
1.1 Principal Component Analysis (PCA)
1.2 Nearest Neighbors
1.3 Discriminant Analysis.
2 Abnormal detection
1 Dimension reduction
由于Curse of Dimensionality的存在,我们需要缩减维度,来使得models得到比较准确的结果,同时减少计算时间和成本。
有两大类Dimension reduction的方法:
-
Feature Selection methods:
Specific features are selected for each data sample from the originallist of features and other features are discarded.
No new features are generated in this process. -
Feature Extraction methods:
Engineer or extract new features from the original list of features in the data.The reduced subset of features will contain newly generated features that were not part of the original feature set. E.g. PCA.
1.1 Principal Component Analysis (PCA)
简单来说,主成分分析PCA是一种linear dimension reduction,寻找在高维数据中最大存在方差的方向,从而保留大部分信息,并将其投射到低维的子空间。
原理
主成分分析在观测数据上通过正交变换(orthogonal transformation)将一组高度correlated的变量转换成一组线性不相关的变量(主成分principal components,这些主成分是原始变量的线性组合)。这样的转换使第一个component的方差最大。
- The first principal component is that linear combination of the original variables whose variance is greatest among all possible linear combinations.
- The second principal component is that linear combination of the original variables that account for a maximum proportion of the remaining variance subject to being uncorrelated with the first principal component.
-
Subsequent components are defined similarly.
PCA实际上通过找到一组可以最能解释数据变动的principal axes来量化变量之间的关系。
Practical tips
- Scale data
注意在PCA之前要对data做standardization,确保变量的scale反映他们本身的range。否则PC会对其中一个变量biased。