主成分分析
在统计学中,主成分分析主要是一种通过降维技术来将数据集进行简化的操作,并且在减少对数据集的维数的同时保证对方差贡献最大。在定量分析研究中人们往往需要对数据中的变量要求更少,信息量更多,所以主成分分析降维的特点正好解决这类问题。
主成分分析在降维的操作上主要采取正交变换,将其分量转化为分量不想关的新随机向量,然后再对多维变量系统进行降维处理。其主要原理就是讲重新将原变量组合成一个一组新的相互无关的变量,同时从中取出几个较少的变量尽可能的反应出原变量的信息。在数学方面简单的说就是在所有线性组合中选取出一个方差最大的F1,称作第一主成分,如果他不足以代表原来的信息,则需要选取出第二个线性组合F2,来反应原有的信息,依次类推。
探索性因子分析
探索性因子分析主要思想就是寻找公共因子,来达到数据降维的目的。与主成分分析不同的是,探索性因子分析是在事先不知道因子的情况下,依据样本数据来对变量进行因子分析,从而得出因子。 探索性因子分析的主要步骤如下:
收集观测样本数据,构造相关矩阵或者协方差矩阵,确定因子个数,提取因子,因子旋转,解释因子,计算因子得分
平行分析法
主要是比较基于真实数据的某一个特征值和随机数据矩阵相应的平均特征值,根据交叉点的位置来选择主成分的个数。
碎石图
显示降序的与分量或因子关联的特征值以及分量或因子的数量。用在主成分分析和因子分析中,以直观地评估哪些分量或因子占数据中变异性的大部分。
碎石图中的理想模式是一条陡曲线,接着是一段弯曲,然后是一条平坦或水平的线。保留陡曲线中在开始平坦线趋势的第一个点之前的那些分量或因子。实际上,可能难以解释碎石图。使用对数据的了解以及根据其他选择分量的方法得到的结果以帮助决定重要分量或因子的数量。
选择因子模型的分析步骤图
数据准备
主要探讨城市工业主体结构,数据包括某事工业部门 13 个行业和 8 个指标,其中 13 个行业分别是冶金、电力、煤炭、化学、机械、建材、森工、食品、纺织、缝纫、皮革、造纸和文教艺术用品,8 个指标分别是年末固定资产净值 X1、职工人数 X2、工业总产值 X3、全员劳动生产率 X4、百元固定原值实现产值 X5、资金利税率 X6、标准燃料消费量 X7 和能源利用效果 X8。
> options(stringsAsFactors=F)
> test <- readLines("http://labfile.oss.aliyuncs.com/courses/931/test.csv")
> test <- unlist(strsplit(test, split=","))
> test <- matrix(test, ncol=8, byrow=T)
> colnames(test) <- test[1,]
> test <- as.data.frame(test[-1,])
> test <- as.data.frame(sapply(test, as.numeric))
> head(test)
X1 X2 X3 X4 X5 X6 X7 X8
1 90342 52455 101091 19272 82.0 16.1 197435 0.172
2 4903 1973 2035 10313 34.2 7.1 592077 0.003
3 6735 21139 3767 1780 36.1 8.2 726396 0.003
4 49454 36241 81557 22504 98.1 25.9 348226 0.985
5 139190 203505 215898 10609 93.2 12.6 139572 0.628
6 12215 16219 10351 6382 62.5 8.7 145818 0.066
主成分分析
作主成分分析主要的函数是 princomp
> test.pr <- princomp(test, cor=T)
> summary(test.pr, loading=T)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4
Standard deviation 1.7619819 1.7017737 0.9640911 0.80175884
Proportion of Variance 0.3880725 0.3620042 0.1161839 0.08035216
Cumulative Proportion 0.3880725 0.7500767 0.8662607 0.94661285
Comp.5 Comp.6 Comp.7 Comp.8
Standard deviation 0.55549344 0.28777809 0.182388563 0.0494213026
Proportion of Variance 0.03857162 0.01035203 0.004158198 0.0003053081
Cumulative Proportion 0.98518446 0.99553649 0.999694692 1.0000000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
X1 0.478 0.294 0.104 0.178 0.761 0.243
X2 0.474 0.276 0.164 -0.175 -0.300 -0.519 0.528
X3 0.426 0.376 0.156 -0.177 -0.780
X4 -0.210 0.452 0.519 0.537 -0.289 -0.248 0.221
X5 -0.387 0.332 0.321 -0.202 -0.454 -0.583 0.222
X6 -0.351 0.405 0.147 0.278 -0.314 0.714
X7 0.213 -0.379 0.140 0.756 -0.424 -0.192
X8 0.273 -0.891 -0.325 -0.119
summary 函数展示了主成分分析的主要信息,Standard deviation 行表示主成分的标准差,Proportion of Variance 行鄙视方差的贡献率, Cumulative Proportion 表示方差的累积贡献率。 由于前 3 个主成分的累积贡献率已经达到了 85% ,另外几个可以舍去,达到降维的目的。
> screeplot(test.pr, type="lines")
> p <- predict(test.pr)
> p
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
[1,] 1.5383166 0.78329878 0.55914834 0.51447981 1.0939484872 -0.0187808278 4.214247e-01 0.004341591
[2,] 0.5058551 -2.69970060 0.23469505 0.88712153 0.1600083295 -0.3019824994 -1.327024e-01 0.070443862
[3,] 1.0828155 -3.36157747 0.42584055 0.60061666 -0.9731163100 0.0678706653 8.022067e-02 -0.025708566
[4,] 0.4824792 1.23193366 -1.03794614 1.66277167 -0.0004448157 0.0749140127 -4.020261e-03 -0.053876926
[5,] 4.7220539 2.33445776 0.49001510 -0.79294924 -0.5148616738 0.0219852173 -1.291664e-01 0.023661090
[6,] 0.3335154 -1.84639307 0.03320975 -0.97490048 0.3886327205 0.2126692286 -2.315539e-02 -0.069531318
[7,] -1.1528228 -0.32238541 0.29771274 -0.72051586 0.0979704754 0.3091870079 -6.784565e-05 -0.036518510
[8,] -2.2807730 2.35196023 1.15228786 0.57465434 -0.6021969592 -0.0004185042 -4.292176e-02 -0.054086746
[9,] -0.8366965 0.90656114 0.33778417 0.15505987 0.5876067643 -0.4389057657 -3.212697e-01 -0.001972479
[10,] -2.1176434 0.87407151 0.24834391 -0.54171795 -0.6801139694 -0.1965720948 2.853128e-01 0.075335379
[11,] -0.7496483 -0.78017371 -0.12474862 -1.15720353 0.2431065198 -0.4037748082 1.583684e-02 -0.030013111
[12,] -1.2531170 0.04010349 0.30197644 0.08694461 0.3886764718 0.6626465187 -1.633114e-01 0.081970107
[13,] -0.2743347 0.48784369 -2.91831915 -0.29436143 -0.1892160403 0.0111618497 1.382012e-02 0.015955626
principal 函数 可以根据原始数据矩阵或者相关系数矩阵作主成分分析
判断主成分的个数主要用到的是 psych 包中的 fa.parallel 函数,对三个特征值(特征值的碎石检验,随机矩阵计算出的特征值均值和大于 1 的特征值准则)
> library(psych)
> fa.parallel(test, fa="pc", n.iter=100, show.legend=F)
Parallel analysis suggests that the number of factors = NA and the number of components = 2
Warning message:
In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.
三个特征值建议选择 2 个主成分。但是有时三个准则并不总相同,需要根据实际情况进行选择主成分的数目。
提取主成分 调用 principal 函数挑选出主成分 根据三个特征值建议,我们先选定 2 个主成分进行主成分提取。
> pc <- principal(test, nfactors=2, rotate="none")
> pc
Principal Components Analysis
Call: principal(r = test, nfactors = 2, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 h2 u2 com
X1 0.84 0.50 0.96 0.041 1.6
X2 0.84 0.47 0.92 0.082 1.6
X3 0.75 0.64 0.97 0.028 2.0
X4 -0.37 0.77 0.73 0.271 1.4
X5 -0.68 0.57 0.78 0.216 1.9
X6 -0.62 0.69 0.86 0.143 2.0
X7 0.38 -0.64 0.56 0.444 1.6
X8 0.10 0.46 0.23 0.775 1.1
PC1 PC2
SS loadings 3.10 2.90
Proportion Var 0.39 0.36
Cumulative Var 0.39 0.75
Proportion Explained 0.52 0.48
Cumulative Proportion 0.52 1.00
Mean item complexity = 1.7
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.09
with the empirical chi square 5.64 with prob < 0.96
Fit based upon off diagonal values = 0.96
从实现结果可以看出方差累积贡献率才 75%,效果不好。下面验证一下主成分为 3 的情况。
> pc <- principal(test, nfactors=3, rotate="none")
> pc
Principal Components Analysis
Call: principal(r = test, nfactors = 3, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 h2 u2 com
X1 0.84 0.50 0.10 0.97 0.0307 1.7
X2 0.84 0.47 0.16 0.94 0.0575 1.7
X3 0.75 0.64 0.15 0.99 0.0056 2.0
X4 -0.37 0.77 -0.01 0.73 0.2713 1.4
X5 -0.68 0.57 0.31 0.88 0.1195 2.4
X6 -0.62 0.69 0.14 0.88 0.1224 2.1
X7 0.38 -0.64 0.13 0.57 0.4258 1.7
X8 0.10 0.46 -0.86 0.96 0.0370 1.6
PC1 PC2 PC3
SS loadings 3.10 2.90 0.93
Proportion Var 0.39 0.36 0.12
Cumulative Var 0.39 0.75 0.87
Proportion Explained 0.45 0.42 0.13
Cumulative Proportion 0.45 0.87 1.00
Mean item complexity = 1.8
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
with the empirical chi square 2.95 with prob < 0.89
Fit based upon off diagonal values = 0.98
可以看出方差累积贡献率达到了 87% ,效果较好。 PC 是成分载荷,表示变量和主成分的相关系数,用来解释主成分的含义。h2 表示主成分对每一个变量的方差解释度,u2 是方差无法被解释的比例(即 1-h2)。 SS loadings 指标准化后的方差值。
主成分旋转 主成分旋转是将成分载荷变得更加容易理解的方法,包括正交旋转(使选择的成分保持不相关)和斜交旋转(使成分变得相关)。 这里我们选择正交旋转即方差极大旋转。
> rc <- principal(test, nfactors=3, rotate="varimax")
> rc
Principal Components Analysis
Call: principal(r = test, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
RC1 RC2 RC3 h2 u2 com
X1 0.98 -0.08 0.11 0.97 0.0307 1.0
X2 0.97 -0.09 0.04 0.94 0.0575 1.0
X3 0.99 0.09 0.09 0.99 0.0056 1.0
X4 0.12 0.82 0.21 0.73 0.2713 1.2
X5 -0.17 0.90 -0.18 0.88 0.1195 1.2
X6 -0.09 0.93 0.02 0.88 0.1224 1.0
X7 -0.02 -0.70 -0.29 0.57 0.4258 1.3
X8 0.14 0.14 0.96 0.96 0.0370 1.1
RC1 RC2 RC3
SS loadings 2.93 2.89 1.10
Proportion Var 0.37 0.36 0.14
Cumulative Var 0.37 0.73 0.87
Proportion Explained 0.42 0.42 0.16
Cumulative Proportion 0.42 0.84 1.00
Mean item complexity = 1.1
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.06
with the empirical chi square 2.95 with prob < 0.89
Fit based upon off diagonal values = 0.98
主成分得分 从原始数据中获得成分得分
> pc <- principal(test, nfactors=3, rotate="varimax", score=T)
> score <- pc$scores
> score
RC1 RC2 RC3
[1,] 1.04364745 -0.03678306 -0.34512037
[2,] -0.55875261 -1.31380572 -0.64465715
[3,] -0.46663927 -1.75504604 -0.91245535
[4,] 0.35978428 0.19045559 1.20712572
[5,] 2.90692027 -0.38189511 0.09594469
[6,] -0.41617365 -0.91925769 -0.32025745
[7,] -0.53962165 0.28399672 -0.38006970
[8,] -0.01787588 1.99714514 -0.79973098
[9,] -0.01209566 0.73557256 -0.20543438
[10,] -0.60535196 1.11405463 -0.17474919
[11,] -0.59851283 -0.13012917 -0.03742436
[12,] -0.47080587 0.47771113 -0.32879003
[13,] -0.62452261 -0.26201898 2.84561855
获得主成分得分的系数
> test.cov <- cov(test)
> rc <- principal(test.cov, nfactors=3, rotate="varimax")
> round(unclass(rc$weights), 3)
RC1 RC2 RC3
X1 0.338 -0.003 -0.034
X2 0.344 0.002 -0.096
X3 0.352 0.063 -0.074
X4 0.047 0.277 0.079
X5 0.005 0.347 -0.277
X6 0.004 0.334 -0.091
X7 0.008 -0.218 -0.194
X8 -0.095 -0.073 0.931
因此,可以得到如下主成分得分: PC1 = 0.338 * X1 + 0.344 * X2 + 0.352 * X3 + 0.047 * X4 + 0.005 * X5 + 0.004 * X6 + 0.008 * X7 - 0.095 * X8 PC2 = -0.003 * X1 + 0.002 * X2 + 0.063 * X3 + 0.277 * X4 + 0.347 * X5 + 0.334 * X6 - 0.218 * X7 - 0.073 * X8 PC3 = -0.034 * X1 - 0.096 * X2 - 0.074 * X3 + 0.079 * X4 - 0.277 * X5 - 0.091 * X6 - 0.194 * X7 - 0.931 * X8
探索性因子分析
EFA 和 PCA 的区别在于,PCA 中的主成分是原始观测变量的线性组合,组合的选择是在各主成分无关条件下使其方差最大化。而 EFA 中的因子是影响原始观测变量的潜在变量,变量中不能被因子所解释的部分称为误差,因子和误差均不能直接观察到。 虽然 EFA 和 PCA 有本质上的区别,但在分析流程上有相似之处。
> library(psych)
> fa.parallel(test, fa="both", n.iter=30)
Parallel analysis suggests that the number of factors = 2 and the number of components = 2
There were 28 warnings (use warnings() to see them)
“fa = both”,即会同时展示主成分和因子分析的结果。 观测图中可以发现,在三个准则的估计下,建议的是 2 个主成分。
调用 psych 包中的 fa 函数来提取因子,将 nfactors 参数设定因子数为 2,rotate 参数不进行因子旋转,最后的 fm 表示分析方法,由于极大似然方法有时不能收敛,所以此处设为迭代主轴方法。
> fa <- fa(test, nfactors=2, rotate="none", fm="pa")
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
> fa
Factor Analysis using method = pa
Call: fa(r = test, nfactors = 2, rotate = "none", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
X1 0.95 0.26 0.96 0.038 1.2
X2 0.91 0.23 0.87 0.126 1.1
X3 0.91 0.44 1.02 -0.017 1.4
X4 -0.14 0.78 0.63 0.366 1.1
X5 -0.47 0.71 0.72 0.278 1.7
X6 -0.41 0.85 0.90 0.102 1.4
X7 0.16 -0.60 0.39 0.614 1.1
X8 0.15 0.30 0.11 0.886 1.5
PA1 PA2
SS loadings 3.00 2.60
Proportion Var 0.38 0.33
Cumulative Var 0.38 0.70
Proportion Explained 0.54 0.46
Cumulative Proportion 0.54 1.00
Mean item complexity = 1.3
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 28 and the objective function was 11.4 with Chi Square of 96.93
The degrees of freedom for the model are 13 and the objective function was 3.23
The root mean square of the residuals (RMSR) is 0.06
The df corrected root mean square of the residuals is 0.09
The harmonic number of observations is 13 with the empirical chi square 2.83 with prob < 1
The total number of observations was 13 with Likelihood Chi Square = 23.14 with prob < 0.04
Tucker Lewis Index of factoring reliability = 0.593
RMSEA index = 0.232 and the 90 % confidence intervals are 0.054 0.421
BIC = -10.2
Fit based upon off diagonal values = 0.98
可以观察到个因子解释了 70% 的总方差。因子载荷的意义并不好解释,所以使用因子旋转有助于因子解释。
> fa.varimax <- fa(test, nfactors=2, rotate="varimax", fm="pa")
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
> fa.varimax
Factor Analysis using method = pa
Call: fa(r = test, nfactors = 2, rotate = "varimax", fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
PA1 PA2 h2 u2 com
X1 0.98 -0.09 0.96 0.038 1.0
X2 0.93 -0.11 0.87 0.126 1.0
X3 1.00 0.09 1.02 -0.017 1.0
X4 0.14 0.78 0.63 0.366 1.1
X5 -0.19 0.83 0.72 0.278 1.1
X6 -0.08 0.94 0.90 0.102 1.0
X7 -0.06 -0.62 0.39 0.614 1.0
X8 0.25 0.23 0.11 0.886 2.0
PA1 PA2
SS loadings 2.95 2.65
Proportion Var 0.37 0.33
Cumulative Var 0.37 0.70
Proportion Explained 0.53 0.47
Cumulative Proportion 0.53 1.00
Mean item complexity = 1.2
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 28 and the objective function was 11.4 with Chi Square of 96.93
The degrees of freedom for the model are 13 and the objective function was 3.23
The root mean square of the residuals (RMSR) is 0.06
The df corrected root mean square of the residuals is 0.09
The harmonic number of observations is 13 with the empirical chi square 2.83 with prob < 1
The total number of observations was 13 with Likelihood Chi Square = 23.14 with prob < 0.04
Tucker Lewis Index of factoring reliability = 0.593
RMSEA index = 0.232 and the 90 % confidence intervals are 0.054 0.421
BIC = -10.2
Fit based upon off diagonal values = 0.98
通过调用 factor.plot() 和 fa.diagram() 函数绘制出正交或斜交结果的图形结果。
> factor.plot(fa.varimax, labels=rownames(fa.varimax$loadings))
X1 X2 X3 在 PA1 上载荷较大,X4 X5 X6 X7 在 PA2 上载荷较大,X8 在两个因子上较为平均。
> fa.diagram(fa.varimax, simple=F)
simple = TRUE 是将仅显示每个因子下最大的载荷,和因子间的相关系数。
与 PCA 相比,EFA 并不是很关心因子得分,主成分得分是通过精确计算得到的,而因子得分只是估计得到的,不过也可以简单查看一下。
> fa <- fa(test, nfactors=2, rotate="none", fm="pa", score=T)
Warning messages:
1: In fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
The estimated weights for the factor scores are probably incorrect. Try a different factor score estimation method.
2: In fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, :
An ultra-Heywood case was detected. Examine the results carefully
> fa$weights
PA1 PA2
X1 -1.62406748 -2.30332697
X2 -4.64153359 -5.75603873
X3 7.27471660 8.56053149
X4 -2.05179521 -2.35273597
X5 -0.28531209 -0.19186435
X6 0.16648968 1.14450904
X7 -0.32570711 -0.58966470
X8 0.05073991 0.07664009
基于银行财务数据的分析运用
搜集银行财务数据来分析股票价格的财务影响因素,观测流动比率、净资产负债比率、资产固定资产比率、每股收益、净利润、增长率、股价和公布时间等数据。
> dataf <- readLines("http://labfile.oss.aliyuncs.com/courses/931/bank.csv")
> dataf <- unlist(strsplit(dataf, split=","))
> dataf <- matrix(dataf, ncol=7, byrow=T)
> colnames(dataf) <- dataf[1,]
> dataf <- as.data.frame(dataf[-1,])
> dataf <- as.data.frame(sapply(dataf, as.numeric))
Warning message:
In lapply(X = X, FUN = FUN, ...) : NAs introduced by coercion
> dataf
流动比率 净资产负债比率 资产固定资产比率 每股收益 净利润 增长率 股价
1 1.0716 0.020515 27.04 0.1925 17.77 -3.942 18.56
2 1.0181 0.009379 113.22 0.1300 14.77 46.914 18.86
3 1.0469 0.013588 85.34 0.2230 14.30 25.433 13.65
4 1.0398 0.013137 93.34 0.2752 14.72 30.732 15.21
5 1.0216 0.013970 88.40 0.1197 14.10 30.578 13.73
6 0.9607 0.013284 93.59 0.1850 16.60 14.550 12.43
7 0.9256 0.011708 102.88 0.2365 14.98 13.879 13.89
8 0.9424 0.011860 103.24 0.3040 13.52 21.894 11.10
9 0.9164 0.011641 103.53 0.0915 14.45 27.488 11.42
10 0.8754 0.010129 112.47 0.1720 13.29 19.168 12.14
11 0.9008 0.009532 127.28 0.2605 12.81 21.915 10.43
12 0.8814 0.009450 133.40 0.3240 11.51 23.687 8.56
13 0.8907 0.008080 128.08 0.1116 12.96 44.718 10.24
14 0.8629 0.009338 236.14 0.1902 11.82 37.542 9.02
15 0.8634 0.009430 117.57 0.2846 11.62 35.189 7.55
16 0.8494 0.010992 107.08 0.3470 10.31 21.157 6.65
17 0.8637 0.010824 105.04 0.1119 12.08 14.875 6.49
18 0.8577 0.011688 110.31 0.1849 10.91 10.622 6.14
19 0.8743 0.009964 98.12 0.3066 11.58 22.350 6.12
20 0.8848 0.010763 116.04 0.3744 10.42 26.894 5.67
21 0.8962 0.009194 97.98 0.1158 11.39 28.249 6.98
22 0.7740 0.009581 105.11 0.2456 11.51 64.610 6.68
23 NA 0.006989 156.92 0.3900 11.79 52.344 6.39
> is.na(dataf)
流动比率 净资产负债比率 资产固定资产比率 每股收益 净利润 增长率 股价
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
由于因子分析对缺失值非常的敏感,在进行因子分析之前我们先对数据进行缺失值的检查。发现流动比率变量第 23 个数据缺失,因此在进行因子分析的时候,对缺失值进行整行删除处理,即在因子分析时排除第 23 行整行的 7 个数据。
利用因子分析提取对银行业上市公司股价影响较为明显的因素,分析银行业上市公司股价的决定因素。
> fa.parallel(dataf[-23, -7])
Parallel analysis suggests that the number of factors = 1 and the number of components = 1
There were 25 warnings (use warnings() to see them)
由碎石图看出,对于因子分析,合适的因子个数为 2。利用 fa 函数对所选取的变量做因子分析,利用极大似然法(ml)提取公因子,运用最大方差旋转法(varimax),找出其中 2 个因子。
> fa(dataf[-23, -7], nfactors=2, fm="ml", rotate="varimax", score=T)
Factor Analysis using method = ml
Call: fa(r = dataf[-23, -7], nfactors = 2, rotate = "varimax",
scores = T, fm = "ml")
Standardized loadings (pattern matrix) based upon correlation matrix
ML1 ML2 h2 u2 com
流动比率 0.60 0.56 0.67 0.331 2.0
净资产负债比率 0.98 0.21 1.00 0.005 1.1
资产固定资产比率 -0.65 -0.17 0.45 0.547 1.1
每股收益 0.05 -0.47 0.23 0.773 1.0
净利润 0.53 0.84 1.00 0.005 1.7
增长率 -0.63 0.02 0.39 0.608 1.0
ML1 ML2
SS loadings 2.41 1.32
Proportion Var 0.40 0.22
Cumulative Var 0.40 0.62
Proportion Explained 0.65 0.35
Cumulative Proportion 0.65 1.00
Mean item complexity = 1.3
Test of the hypothesis that 2 factors are sufficient.
The degrees of freedom for the null model are 15 and the objective function was 3.13 with Chi Square of 56.79
The degrees of freedom for the model are 4 and the objective function was 0.02
The root mean square of the residuals (RMSR) is 0.02
The df corrected root mean square of the residuals is 0.04
The harmonic number of observations is 22 with the empirical chi square 0.22 with prob < 0.99
The total number of observations was 22 with Likelihood Chi Square = 0.38 with prob < 0.98
Tucker Lewis Index of factoring reliability = 1.361
RMSEA index = 0 and the 90 % confidence intervals are 0 0
BIC = -11.99
Fit based upon off diagonal values = 1
Measures of factor score adequacy
ML1
Correlation of (regression) scores with factors 1.00
Multiple R square of scores with factors 0.99
Minimum correlation of possible factor scores 0.99
ML2
Correlation of (regression) scores with factors 0.99
Multiple R square of scores with factors 0.99
Minimum correlation of possible factor scores 0.98
结果说明: 两个因子的累计贡献方差(Cumulative Var)为 62%,说明得到的两个因子能解释所有变量 62% 的信息。 各变量与两个因子的关系如下: 流动比率 = 0.60 × 因子A + 0.56 × 因子B 净资产负债比率 = 0.98 × 因子A + 0.21 × 因子B 资产固定资产比率 = -0.65 × 因子A - 0.17 × 因子B 每股收益 = 0.05 × 因子A - 0.48 × 因子B 净利润 = 0.53 × 因子A + 0.84 × 因子B 增长率 = -0.63 × 因子A - 0.02 × 因子B 因子 A 主要影响流动比率、净资产负债比率、资产固定资产比率和增长率。其中因子 A 对流动比率和净资产负债比率有正向影响而对资产固定资产比率和增长率有负向影响。将它称为资产因子。 因子 B 主要影响每股收益、净利润。其中因子 B 对净利润有正向作用而对每股收益则为负向作用。将它称为收益因子。