关联分析是发现交易数据内有趣联系的一种方法,比如著名的“啤酒-尿布”。频繁序列模式挖掘,可以预测购买行为,生物序列等等。
10.2 数据转换成事务
链表、矩阵和数据框架转换成事务
# 数据转换成事务
install.packages("arules")
library(arules)
tr_list <- list(c("Apple", "Bread", "Cake"),
c("Apple", "Bread", "Milk"),
c("Apple", "Cake", "Milk"))
names(tr_list) <- paste("Tr",c(1:3), sep = "")
trans <- as(tr_list, "transactions");trans
tr_matrix <- matrix(
c(1,1,1,0,
1,1,0,1,
0,1,1,1),ncol = 4
)
dimnames(tr_matrix) <- list(
paste("Tr",c(1:3), sep = ""),
c("Apple", "Bread", "Cake", "Milk")
)
trans2 <- as(tr_matrix, "transactions");trans2
Tr_df <- data.frame(
TrID = as.factor(c(1,2,1,1,2,3,2,3,2,3)),
Item = as.factor(c("Apple", "Milk", "Cake", "Bread",
"Cake", "Milk", "Apple", "Cake",
"Bread", "Bread"))
)
trans3 <- as(split(Tr_df[,"Item"], Tr_df[,"TrID"]),
"transactions");trans3
调用as函数给每次事务都加上一个id就完成了数据向事务的类型转换。“transactions"类型来代表规则或频繁项集的事务型数据,是itemMatrix类型的延伸。
10.3 展示事务及关联
R的arule包使用自带的transactions类型来存储事务数据类型。
LIST(trans) # 列表形式展示数据
$Tr1
[1] "Apple" "Bread" "Cake"
$Tr2
[1] "Apple" "Bread" "Milk"
$Tr3
[1] "Apple" "Cake" "Milk"
summary(trans)
transactions as itemMatrix in sparse format with
3 rows (elements/itemsets/transactions) and
4 columns (items) and a density of 0.75
most frequent items:
Apple Bread Cake Milk (Other)
3 2 2 2 0
element (itemset/transaction) length distribution:
sizes
3
3
Min. 1st Qu. Median Mean 3rd Qu. Max.
3 3 3 3 3 3
includes extended item information - examples:
labels
1 Apple
2 Bread
3 Cake
includes extended transaction information - examples:
transactionID
1 Tr1
2 Tr2
3 Tr3
inspect(trans)
items transactionID
[1] {Apple, Bread, Cake} Tr1
[2] {Apple, Bread, Milk} Tr2
[3] {Apple, Cake, Milk} Tr3
filter_trans <- trans[size(trans) >=3]
inspect(filter_trans)
items transactionID
[1] {Apple, Bread, Cake} Tr1
[2] {Apple, Bread, Milk} Tr2
[3] {Apple, Cake, Milk} Tr3
image(trans)
itemFrequencyPlot(trans)
itemFrequency(trans) # 支持度的分布
Apple Bread Cake Milk
1.0000000 0.6666667 0.6666667 0.6666667
10.4 Apriori规则完成关联挖掘
首先找到频繁个体项集,然后再通过广度优先搜索策略生成更大的频繁项集。
data("Groceries")
summary(Groceries)
itemFrequencyPlot(Groceries,support=0.1, cex.names=0.8, topN=10)
rules <- apriori(Groceries, parameter = list(supp=0.001, conf=0.5,
target="rules"))
summary(rules)
inspect(head(rules))
lhs rhs support confidence coverage lift count
[1] {honey} => {whole milk} 0.001118454 0.7333333 0.001525165 2.870009 11
[2] {tidbits} => {rolls/buns} 0.001220132 0.5217391 0.002338587 2.836542 12
[3] {cocoa drinks} => {whole milk} 0.001321810 0.5909091 0.002236909 2.312611 13
[4] {pudding powder} => {whole milk} 0.001321810 0.5652174 0.002338587 2.212062 13
[5] {cooking chocolate} => {whole milk} 0.001321810 0.5200000 0.002541942 2.035097 13
[6] {cereals} => {whole milk} 0.003660397 0.6428571 0.005693950 2.515917 36
rules <- sort(rules, by="confidence", decreasing = TRUE)
inspect(head(rules))
lhs rhs support confidence coverage lift count
[1] {rice,
sugar} => {whole milk} 0.001220132 1 0.001220132 3.913649 12
[2] {canned fish,
hygiene articles} => {whole milk} 0.001118454 1 0.001118454 3.913649 11
[3] {root vegetables,
butter,
rice} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
[4] {root vegetables,
whipped/sour cream,
flour} => {whole milk} 0.001728521 1 0.001728521 3.913649 17
[5] {butter,
soft cheese,
domestic eggs} => {whole milk} 0.001016777 1 0.001016777 3.913649 10
[6] {citrus fruit,
root vegetables,
soft cheese} => {other vegetables} 0.001016777 1 0.001016777 5.168156 10
>
可以通过支持度和关联度两个值来评估规则的强弱,前者表示规则的频率代表两个项集同时出现在一个事务中的概率。这两个指标仅对规则强弱判断有效,一些规则也可能是冗余的,提升度可以评估规则的质量。支持度代表了特定项集地事务数据库中的所占比例,置信度是规则的正确率,提升度是响应目标关联规则与平均响应的比值。
Apriori是最广为人知的关联规则挖掘算法,依靠逐层地广度优先策略来生成候选项集。
还可以调用intersectMeasure函数来获得其他有趣的指标。
head(interestMeasure(rules,c("support", "chiSquare", "confidence",
+ "conviction", "cosine", "coverage",
+ "leverage", "lift", "oddsRatio"),
+ Groceries))
support chiSquared confidence conviction cosine coverage leverage lift
1 0.001220132 35.00650 1 Inf 0.06910260 0.001220132 0.0009083689 3.913649
2 0.001118454 32.08603 1 Inf 0.06616070 0.001118454 0.0008326715 3.913649
3 0.001016777 29.16615 1 Inf 0.06308175 0.001016777 0.0007569741 3.913649
4 0.001728521 49.61780 1 Inf 0.08224854 0.001728521 0.0012868559 3.913649
5 0.001016777 29.16615 1 Inf 0.06308175 0.001016777 0.0007569741 3.913649
6 0.001016777 41.72398 1 Inf 0.07249042 0.001016777 0.0008200380 5.168156
oddsRatio
1 Inf
2 Inf
3 Inf
4 Inf
5 Inf
6 Inf
10.5 去掉冗余规则
关联规则挖掘的两个主要限制是在支持度和置信度之间的选择,去冗余,发现这些规则中真正有意义的信息。初接触这个领域,不是太懂,先放这。
# derep
rules.sorted <- sort(rules, by="lift")
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=TRUE) >= 8
rules.pruned <- rules.sorted[!redundant]
inspect(head(rules.pruned))
lhs rhs support confidence coverage lift count
[1] {citrus fruit,
pip fruit,
bottled water} => {whole milk} 0.001118454 0.5 0.002236909 1.956825 11
[2] {citrus fruit,
pip fruit,
yogurt} => {whole milk} 0.001626843 0.5 0.003253686 1.956825 16
[3] {root vegetables,
rolls/buns,
soda} => {whole milk} 0.002440264 0.5 0.004880529 1.956825 24
[4] {tropical fruit,
other vegetables,
rolls/buns,
soda} => {whole milk} 0.001118454 0.5 0.002236909 1.956825 11
发现按原书>=1就没有规则幸存了,就看了一下,设置成4以上会有,就设置了个8。发现小内存的电脑在这步可能崩溃,因为矩阵很大,比如小云主机,在16G内存的电脑上是能成功运行的。
10.6 关联规则的可视化
重新把上面设置成了1800,为了可视化好看
install.packages("arulesViz")
library(arulesViz)
install.packages('Rcpp')
library(Rcpp) #发现要安装这个包,否则报:Error in stress_major(xinit, W, D, iter, tol) :
function 'Rcpp_precious_remove' not provided by package 'Rcpp'
plot(rules.pruned)
#避免重叠,抖动下
plot(rules.pruned,shading='order', control =list(jitter=6))
# Apriori算法生成左边为soda的新规则
soda_rule <- apriori(data = Groceries, parameter = list(supp=0.001,conf=0.1,minlen=2),
appearance = list(default='rhs',lhs='soda'))
plot(sort(soda_rule,by="lift"), method = 'graph',
control = list(type='items'))
plot(rules.pruned[c(1:66)],method="group") # 这里报:
Error in stats::hclust(stats::dist(m_clust)) :
must have n >= 2 objects to cluster
于是用上一步的代替啦
交互式图形
plot(rules.pruned[c(1:66)],interactive = TRUE)
10.7 使用Eclat挖掘频繁项集
Apriori算法采用广度优先策略来遍历数据库,整体耗时较长;如果数据库可以整个装入内存中,可以使用深度优先的Eclat算法,效率比前者高。
# Eclat
frequentest <- eclat(Groceries, parameter = list(support=0.05,maxlen=10))
summary(frequentest)
inspect(sort(frequentest, by="support")[1:10])
items support count
[1] {whole milk} 0.25551601 2513
[2] {other vegetables} 0.19349263 1903
[3] {rolls/buns} 0.18393493 1809
[4] {soda} 0.17437722 1715
[5] {yogurt} 0.13950178 1372
[6] {bottled water} 0.11052364 1087
[7] {root vegetables} 0.10899847 1072
[8] {tropical fruit} 0.10493137 1032
[9] {shopping bags} 0.09852567 969
[10] {sausage} 0.09395018 924
Apriori算法直接也易于理解,缺点是需要多遍扫描数据库因而会产生大量候选集,支持度的计算很耗时。Eclat算法采用了等价类、深度优先遍历、求次等策略,支持度计算效率有很大改善。前者采用水平数据结构来存放事务,后者采用垂直数据结构来存放每个事务的交易ID,也从频繁项集中生成关联规则。
FP-Growth也是应用非常广的一种关联规则挖掘算法,与Eclat算法相似,也是采用深度优先搜索策略来计算项集支持度,暂时没包支持?2021了,或许有了吧。
10.8 生成时态事务数据
# 时态
install.packages("arulesSequences")
library(arulesSequences)
tmp_data <- list(
c("A"),
c("A","B","C"),
c("D"),
c("c","F"),
c("A","D"),
c("C"),
c("B", "C"),
c("A","E"),
c("E","F"),
c("A","B"),
c("D","F"),
c("C"),
c("B"),
c("E"),
c("G"),
c("A","F"),
c("C"),
c("B"),
c("C")
)
# 转换为事务
names(tmp_data) <- paste0("Tr", c(1:19), seq="")
trans <- as(tmp_data, "transactions")
transactionInfo(trans)$SequenceID <- c(1,1,1,1,1,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
transactionInfo(trans)$eventID<- c(10,20,30,40,50,10,20,30,40,10,20,30,40,50,
10,20,30,40,50)
trans
transactions in sparse format with
19 transactions (rows) and
8 items (columns)
inspect(head(trans))
items transactionID eventID
[1] {A} Tr1 10
[2] {A,B,C} Tr2 20
[3] {D} Tr3 30
[4] {c,F} Tr4 40
[5] {A,D} Tr5 50
[6] {C} Tr6 10
summary(trans)
zaki <- read_baskets(con = system.file("misc", "zaki.txt",
package = "arulesSequences"),
info = c("sequenceID", 'eventID', "SIZE"))
as(zaki,"data.frame")
items sequenceID eventID SIZE
1 {C,D} 1 10 2
2 {A,B,C} 1 15 3
3 {A,B,F} 1 20 3
4 {A,C,D,F} 1 25 4
5 {A,B,F} 2 15 3
6 {E} 2 20 1
7 {A,B,F} 3 10 3
8 {D,G,H} 4 10 3
9 {B,F} 4 20 2
10 {A,G,H} 4 25 3
arulesSequences提供了两个新的数据结构,sequences和timedsequences,用来表示纯序列和时态序列数据。
10.9 cSPADE挖掘频繁时序模式
等价类序列模式挖掘,是广为人知的一种频繁序列模式挖掘算法,利用垂直数据库的特性,通过ID表的交集及有效的搜索策略完成频繁序列模式的挖掘,支持对挖掘到的序列添加约束。
s_result <- cspade(trans, parameter = list(support=0.75), control = list(verbose=TRUE))
# Error in cspade(trans, parameter = list(support = 0.75), control = list(verbose = TRUE)) :
# transactionInfo: missing 'sequenceID' and/or 'eventID'
错报复有点奇怪,不得法呀!这章就到这啦!