Eclat
Eclat 算法用于频繁项集的挖掘。这种情况下,我们寻找行为相似的模式,与之相对的是寻找不规则模式(与处理其他数据挖掘的方法类似)。 Algorithm 通过数据中的交集来估算同时频繁出现事件候选项(如购物车项目)的支持度。然后通过对频繁候选项进行测试来证实数据集中的模式。
1,使用 eclat 找到成年人行为的相似点
> library(Matrix)
Warning message:
程辑包‘Matrix’是用R版本3.6.3 来建造的
> library(arules)
载入程辑包:‘arules’
The following objects are masked from ‘package:base’:
abbreviate, write
Warning message:
程辑包‘arules’是用R版本3.6.3 来建造的
> data("Adult")
> dim(Adult)
[1] 48842 115
> summary(Adult)
transactions as itemMatrix in sparse format with
48842 rows (elements/itemsets/transactions) and
115 columns (items) and a density of 0.1089939
most frequent items:
capital-loss=None capital-gain=None
46560 44807
native-country=United-States race=White
43832 41762
workclass=Private (Other)
33906 401333
element (itemset/transaction) length distribution:
sizes
9 10 11 12 13
19 971 2067 15623 30162
Min. 1st Qu. Median Mean 3rd Qu. Max.
9.00 12.00 13.00 12.53 13.00 13.00
includes extended item information - examples:
labels variables levels
1 age=Young age Young
2 age=Middle-aged age Middle-aged
3 age=Senior age Senior
includes extended transaction information - examples:
transactionID
1 1
2 2
3 3
检查最终结果时,我们会注意到以下细节:
摘要共 48842 行,115 列。
已列出常见项目:白种人。
有很多描述符,如 age=Young。
2,查找数据集中的频繁项目
> data("Adult")
> itemsets <- eclat(Adult)
Eclat
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.1 1 10 frequent itemsets TRUE
algorithmic control:
sparse sort verbose
7 -2 TRUE
Absolute minimum support count: 4884
create itemset ...
set transactions ...[115 item(s), 48842 transaction(s)] done [0.04s].
sorting and recoding items ... [31 item(s)] done [0.01s].
creating bit matrix ... [31 row(s), 48842 column(s)] done [0.00s].
writing ... [2616 set(s)] done [0.01s].
Creating S4 object ... done [0.00s].
默认值已发现 2616 个频繁集合。如果我们寻找前五个集合,将会看到下列输出数据:
> itemsets.sorted <- sort(itemsets)
> itemsets.sorted[1:5]
以下是对之前输出数据的研究所得:
普查数据中的大多数人未要求资本损失或资本利得(这种财政税收事件并非正常状态)。
大多数人来自美国。
大多数是白种人。
3,集中于最高频率的示例
为了进一步证实数据,我们可以将范围缩减至数据集中出现的最高频率(可以通过调节 minlen 参数直至处理完一项集合来实现操作):
> itemsets <- eclat(Adult, parameter=list(minlen=9))
Eclat
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.1 9 10 frequent itemsets TRUE
algorithmic control:
sparse sort verbose
7 -2 TRUE
Absolute minimum support count: 4884
create itemset ...
set transactions ...[115 item(s), 48842 transaction(s)] done [0.04s].
sorting and recoding items ... [31 item(s)] done [0.01s].
creating bit matrix ... [31 row(s), 48842 column(s)] done [0.00s].
writing ... [1 set(s)] done [0.00s].
Creating S4 object ... done [0.00s].
> inspect(itemsets)
items support transIdenticalToItemsets count
[1] {age=Middle-aged,
workclass=Private,
marital-status=Married-civ-spouse,
relationship=Husband,
race=White,
sex=Male,
capital-gain=None,
capital-loss=None,
native-country=United-States} 0.1056673 5161 5161
按照预期,由一位美国本土且拥有工作的已婚男士填写普查数据表格。
arulesNBMiner
arulesNBMiner 是一个功能包,用于寻找一个集合中两个或两个以上项目的共现。底层模型,即负二项式模型,允许高度偏态次数分配,否则会很难确定最小项集容量。我们在正被挖掘的较大数据集中寻找频繁数据集。当确定使用 arulesNBMiner 时,您应该看到一些迹象:项目集频率正出现在数据子集合中。
> library(rJava)
> library(arulesNBMiner)
Warning message:
程辑包‘arulesNBMiner’是用R版本3.6.3 来建造的
> data(Agrawal)
> summary(Agrawal.db)
transactions as itemMatrix in sparse format with
20000 rows (elements/itemsets/transactions) and
1000 columns (items) and a density of 0.0099933
most frequent items:
item446 item938 item818 item457 item401 (Other)
1638 1514 1450 1397 1389 192478
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16 68 215 427 763 1234 1813 2215 2341 2437 2320 1896 1457 1045 739
16 17 18 19 20 21 22 23 24
447 260 171 74 25 16 15 2 4
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 8.000 10.000 9.993 12.000 24.000
includes extended item information - examples:
labels
1 item1
2 item2
3 item3
includes extended transaction information - examples:
transactionID
1 trans1
2 trans2
3 trans3
> summary(Agrawal.pat)
set of 2000 itemsets
most frequent items:
item938 item446 item457 item615 item594 (Other)
38 37 34 29 28 3844
element (itemset/transaction) length distribution:sizes
1 2 3 4 5 6
721 759 353 132 26 9
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 2.000 2.005 3.000 6.000
summary of quality measures:
pWeights pCorrupts
Min. :8.742e-07 Min. :0.0000
1st Qu.:1.476e-04 1st Qu.:0.2748
Median :3.392e-04 Median :0.4881
Mean :5.000e-04 Mean :0.4920
3rd Qu.:6.899e-04 3rd Qu.:0.7085
Max. :3.150e-03 Max. :1.0000
includes transaction ID lists: FALSE
1,为频繁集挖掘 Agrawal 数据
> mynbparameters <- NBMinerParameters(Agrawal.db)
> mynbminer <- NBMiner(Agrawal.db, parameter=mynbparameters)
> summary(mynbminer)
set of 3462 itemsets
most frequent items:
item594 item446 item818 item938 item208 (Other)
58 54 53 53 52 7138
element (itemset/transaction) length distribution:sizes
1 2 3 4 5
1000 1364 772 266 60
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 1.00 2.00 2.14 3.00 5.00
summary of quality measures:
precision
Min. :0.9901
1st Qu.:1.0000
Median :1.0000
Mean :0.9997
3rd Qu.:1.0000
Max. :1.0000
includes transaction ID lists: FALSE
以下是对之前输出数据的研究所得:
项目近乎均匀分布。
项集长度 1 或 2 有较大偏斜。
Apriori
Apriori 是可以帮助了解关联规则的分类算法。与事务的实施方式相对。这种算法尝试找到数据集中常见的子集合,必须满足最小阈值以便核实关联。 Apriori 方法会从您的数据集中返回有趣的关联,如当出现 Y 时,会返回 X。支持度是包含 X 和 Y 的事务的百分比。置信度是同时包含 X 和 Y 的事务的百分比。支持度的默认值为 10,置信度的默认值为 80。
1,评估购物篮中的关联
> library(Matrix)
> library(arules)
> tr <- read.transactions("http://labfile.oss.aliyuncs.com/courses/887/retail.dat", format="basket")
> summary(tr)
transactions as itemMatrix in sparse format with
88162 rows (elements/itemsets/transactions) and
16470 columns (items) and a density of 0.0006257289
most frequent items:
39 48 38 32 41 (Other)
50675 42135 15596 15167 14945 770058
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14
3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620
15 16 17 18 19 20 21 22 23 24 25 26 27 28
2310 2115 1874 1645 1469 1290 1205 981 887 819 684 586 582 472
29 30 31 32 33 34 35 36 37 38 39 40 41 42
480 355 310 303 272 234 194 136 153 123 115 112 76 66
43 44 45 46 47 48 49 50 51 52 53 54 55 56
71 60 50 44 37 37 33 22 24 21 21 10 11 10
57 58 59 60 61 62 63 64 65 66 67 68 71 73
9 11 4 9 7 4 5 2 2 5 3 3 1 1
74 76
1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 8.00 10.31 14.00 76.00
includes extended item information - examples:
labels
1 0
2 1
3 10
以下是对之前输出数据的研究所得: 共 88162 个购物篮,对应 16470 个项目。 成对项目很受欢迎(项目 39 有 50675 个)
看一下最频繁的项目:
> itemFrequencyPlot(tr, support=0.1)
为合适的关联构建一些规则:
> rules <- apriori(tr, parameter=list(supp=0.5, conf=0.5))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.5 0.1 1 none FALSE TRUE 5 0.5 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 44081
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[16470 item(s), 88162 transaction(s)] done [0.47s].
sorting and recoding items ... [1 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object ... done [0.01s].
> summary(rules)
set of 1 rules
rule length distribution (lhs + rhs):sizes
1
1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
summary of quality measures:
support confidence coverage lift
Min. :0.5748 Min. :0.5748 Min. :1 Min. :1
1st Qu.:0.5748 1st Qu.:0.5748 1st Qu.:1 1st Qu.:1
Median :0.5748 Median :0.5748 Median :1 Median :1
Mean :0.5748 Mean :0.5748 Mean :1 Mean :1
3rd Qu.:0.5748 3rd Qu.:0.5748 3rd Qu.:1 3rd Qu.:1
Max. :0.5748 Max. :0.5748 Max. :1 Max. :1
count
Min. :50675
1st Qu.:50675
Median :50675
Mean :50675
3rd Qu.:50675
Max. :50675
mining info:
data ntransactions support confidence
tr 88162 0.5 0.5
规则的支持度有力,置信度较低。
具体规则:
> inspect(rules)
lhs rhs support confidence coverage lift count
[1] {} => {39} 0.5747941 0.5747941 1 1 50675
正如我们猜想的那样,大多数人将项目 39 放入购物篮
寻找更多与规则相关的信息
> interestMeasure(rules, c("support", "chiSquare", "confidence", "conviction", "cosine", "leverage", "lift", "oddsRatio"), tr)
support chiSquared confidence conviction cosine leverage lift oddsRatio
1 0.5747941 NA 0.5747941 1 0.7581518 0 1 NA
用 TraMineR 确定序列
TraMineR 功能包用于挖掘序列,并将其可视化,其思想是发现序列。可以将序列分布、序列频率及湍流等绘图的图解设备构建到功能包中。此外,还有一些自然出现的项目,其中的数据有重复的序列,如在一些社会科学场地,数据会自然地循环项目。 通过此文件,我将带您大概了解 TraMineR,以便生成一系列用于发现序列的工具。在挖掘操作中选择何种工具取决于您自己。
1,确定训练和职业中的序列
在这一示例中,我们将看到人们生活中从训练到工作的进程中时间的序列。我们期望看到从失业未经训练的状态至经过训练并最终成为全职员工的进程。
> library(TraMineR)
TraMineR stable version 2.2-0.1 (Built: 2020-09-07)
Website: http://traminer.unige.ch
Please type 'citation("TraMineR")' for citation information.
Warning message:
程辑包‘TraMineR’是用R版本3.6.3 来建造的
> data(mvad)
> summary(mvad)
id weight male catholic Belfast N.Eastern
Min. : 1.0 Min. :0.1300 no :342 no :368 no :624 no :503
1st Qu.:178.8 1st Qu.:0.4500 yes:370 yes:344 yes: 88 yes:209
Median :356.5 Median :0.6900
Mean :356.5 Mean :0.9994
3rd Qu.:534.2 3rd Qu.:1.0700
Max. :712.0 Max. :4.4600
Southern S.Eastern Western Grammar funemp gcse5eq fmpr livboth
no :497 no :629 no :595 no :583 no :595 no :452 no :537 no :261
yes:215 yes: 83 yes:117 yes:129 yes:117 yes:260 yes:175 yes:451
Jul.93 Aug.93 Sep.93 Oct.93
school :135 school :135 school :179 school :175
FE : 97 FE : 98 FE :275 FE :276
employment :173 employment :178 employment : 83 employment : 88
training :122 training :127 training :158 training :158
joblessness:185 joblessness:174 joblessness: 17 joblessness: 15
HE : 0 HE : 0 HE : 0 HE : 0
Nov.93 Dec.93 Jan.94 Feb.94
school :174 school :172 school :171 school :172
FE :272 FE :271 FE :263 FE :259
employment : 95 employment : 98 employment :100 employment :100
training :157 training :156 training :158 training :154
joblessness: 14 joblessness: 15 joblessness: 20 joblessness: 27
HE : 0 HE : 0 HE : 0 HE : 0
Mar.94 Apr.94 May.94 Jun.94
school :171 school :171 school :170 school :165
FE :257 FE :251 FE :247 FE :232
employment :106 employment :112 employment :117 employment :130
training :154 training :153 training :150 training :151
joblessness: 24 joblessness: 25 joblessness: 28 joblessness: 34
HE : 0 HE : 0 HE : 0 HE : 0
Jul.94 Aug.94 Sep.94 Oct.94
school :140 school :139 school :143 school :144
FE :196 FE :196 FE :221 FE :222
employment :178 employment :184 employment :167 employment :172
training :142 training :144 training :146 training :137
joblessness: 56 joblessness: 49 joblessness: 35 joblessness: 37
HE : 0 HE : 0 HE : 0 HE : 0
Nov.94 Dec.94 Jan.95 Feb.95
school :144 school :143 school :144 school :143
FE :220 FE :219 FE :218 FE :211
employment :176 employment :181 employment :182 employment :185
training :137 training :133 training :128 training :127
joblessness: 35 joblessness: 36 joblessness: 40 joblessness: 46
HE : 0 HE : 0 HE : 0 HE : 0
Mar.95 Apr.95 May.95 Jun.95
school :143 school :142 school :142 school :139
FE :210 FE :203 FE :200 FE :189
employment :190 employment :199 employment :205 employment :215
training :124 training :120 training :118 training :112
joblessness: 45 joblessness: 48 joblessness: 47 joblessness: 57
HE : 0 HE : 0 HE : 0 HE : 0
Jul.95 Aug.95 Sep.95 Oct.95
school :149 school :149 school : 58 school : 30
FE :140 FE :138 FE :152 FE :137
employment :269 employment :273 employment :305 employment :294
training : 93 training : 88 training : 84 training : 81
joblessness: 58 joblessness: 61 joblessness: 61 joblessness: 57
HE : 3 HE : 3 HE : 52 HE :113
Nov.95 Dec.95 Jan.96 Feb.96
school : 29 school : 29 school : 27 school : 27
FE :136 FE :135 FE :132 FE :132
employment :296 employment :296 employment :301 employment :300
training : 79 training : 80 training : 81 training : 80
joblessness: 56 joblessness: 56 joblessness: 57 joblessness: 60
HE :116 HE :116 HE :114 HE :113
Mar.96 Apr.96 May.96 Jun.96
school : 27 school : 27 school : 27 school : 27
FE :125 FE :125 FE :124 FE :122
employment :308 employment :313 employment :315 employment :324
training : 78 training : 78 training : 78 training : 74
joblessness: 61 joblessness: 56 joblessness: 55 joblessness: 53
HE :113 HE :113 HE :113 HE :112
Jul.96 Aug.96 Sep.96 Oct.96
school : 18 school : 17 school : 8 school : 0
FE : 83 FE : 83 FE : 82 FE : 79
employment :388 employment :392 employment :387 employment :379
training : 58 training : 55 training : 51 training : 51
joblessness: 58 joblessness: 59 joblessness: 59 joblessness: 56
HE :107 HE :106 HE :125 HE :147
Nov.96 Dec.96 Jan.97 Feb.97
school : 0 school : 0 school : 0 school : 0
FE : 80 FE : 80 FE : 79 FE : 79
employment :378 employment :380 employment :382 employment :385
training : 50 training : 49 training : 46 training : 43
joblessness: 56 joblessness: 56 joblessness: 59 joblessness: 59
HE :148 HE :147 HE :146 HE :146
Mar.97 Apr.97 May.97 Jun.97
school : 0 school : 0 school : 0 school : 0
FE : 76 FE : 75 FE : 74 FE : 72
employment :386 employment :392 employment :394 employment :400
training : 42 training : 40 training : 38 training : 37
joblessness: 61 joblessness: 60 joblessness: 61 joblessness: 60
HE :147 HE :145 HE :145 HE :143
Jul.97 Aug.97 Sep.97 Oct.97
school : 0 school : 0 school : 0 school : 0
FE : 44 FE : 44 FE : 37 FE : 29
employment :429 employment :431 employment :435 employment :434
training : 26 training : 22 training : 24 training : 23
joblessness: 78 joblessness: 80 joblessness: 75 joblessness: 73
HE :135 HE :135 HE :141 HE :153
Nov.97 Dec.97 Jan.98 Feb.98
school : 0 school : 0 school : 0 school : 0
FE : 29 FE : 29 FE : 27 FE : 26
employment :441 employment :443 employment :443 employment :444
training : 22 training : 22 training : 21 training : 17
joblessness: 67 joblessness: 66 joblessness: 70 joblessness: 74
HE :153 HE :152 HE :151 HE :151
Mar.98 Apr.98 May.98 Jun.98
school : 0 school : 0 school : 0 school : 0
FE : 26 FE : 26 FE : 25 FE : 25
employment :447 employment :449 employment :450 employment :454
training : 17 training : 17 training : 16 training : 15
joblessness: 72 joblessness: 71 joblessness: 72 joblessness: 71
HE :150 HE :149 HE :149 HE :147
Jul.98 Aug.98 Sep.98 Oct.98
school : 0 school : 0 school : 0 school : 0
FE : 14 FE : 14 FE : 14 FE : 9
employment :477 employment :482 employment :479 employment :482
training : 11 training : 11 training : 13 training : 13
joblessness: 81 joblessness: 80 joblessness: 85 joblessness: 82
HE :129 HE :125 HE :121 HE :126
Nov.98 Dec.98 Jan.99 Feb.99
school : 0 school : 0 school : 0 school : 0
FE : 8 FE : 8 FE : 9 FE : 9
employment :484 employment :481 employment :484 employment :485
training : 12 training : 13 training : 13 training : 10
joblessness: 83 joblessness: 85 joblessness: 82 joblessness: 85
HE :125 HE :125 HE :124 HE :123
Mar.99 Apr.99 May.99 Jun.99
school : 0 school : 0 school : 0 school : 0
FE : 9 FE : 9 FE : 9 FE : 9
employment :483 employment :483 employment :482 employment :484
training : 9 training : 9 training : 8 training : 8
joblessness: 88 joblessness: 89 joblessness: 93 joblessness: 93
HE :123 HE :122 HE :120 HE :118
我们可以查看标准标识符来了解体重、性别、宗教等信息。 截取序列数据(我们正通过 86 使用 17 列,因为这适用于人们在数据调查不同点的状态),并将数据的这部分应用于序列确定函数。
> myseq <- seqdef(mvad, 17:86)
[>] 6 distinct states appear in the data:
1 = employment
2 = FE
3 = HE
4 = joblessness
5 = school
6 = training
[>] state coding:
[alphabet] [label] [long label]
1 employment employment employment
2 FE FE FE
3 HE HE HE
4 joblessness joblessness joblessness
5 school school school
6 training training training
[>] 712 sequences in the data set
[>] min/max sequence length: 70/70
这样看来是正确的,我们可以参照相关状态(失业、上学、训练及工作)来获取所需的行序列数据。
> seqiplot(myseq)
通过参照个人不同状态间界定的转换期,您会发现连续几个月都有训练。您应进行核实,以便数据显示的信息与您对序列数据的理解相一致。
> seqfplot(myseq)
现在我们来看序列在不同时间的频率。多次观看后我们会看到同一序列的人群集,如经过一段时间的训练后会有工作。
> seqdplot(myseq)
我们来看看序列状态在不同时期的分布情况。通常情况下,人们在上学或训练后开始工作。
> seqHtplot(myseq)
熵在不同时期的变化特点:明显降低后会出现细微的上升。这与不同人群会在最初做出不同选择的情况一致(很多状态),如上学或训练,然后进行工作,成为劳动力(一种状态)。 有一个有趣的想法为数据湍流。湍流传达出一个信息,即从数据中可见的某个特定事例可以推导出多少不同的后续序列。
> myturbulence <- seqST(myseq)
> hist(myturbulence)
我们可以看到带有长尾数的近乎标准化分布。大多数状态分为少量后续状态以及少数状态或多或少的异常值。
序列相似点
最长公共前缀(LCP):我们可以通过比较相同的最长序列前缀来确定相似点。
最长公共序列(LCS):我们也可以通过查看两个序列之间的相同部分,根据其内部的最长序列来确定相似点。
最佳匹配(OM)距离:指生成一个不同序列的最佳编辑距离,在此距离下,插入及删除的成本最小。
界定可用的序列对象
> data(famform)
> seq <- seqdef(famform)
[>] found missing values ('NA') in sequence data
[>] preparing 5 sequences
[>] coding void elements with '%' and missing values with '*'
[>] 5 distinct states appear in the data:
1 = M
2 = MC
3 = S
4 = SC
5 = U
[>] state coding:
[alphabet] [label] [long label]
1 M M M
2 MC MC MC
3 S S S
4 SC SC SC
5 U U U
[>] 5 sequences in the data set
[>] min/max sequence length: 2/5
> seq
Sequence
[1] S-U
[2] S-U-M
[3] S-U-M-MC
[4] S-U-M-MC-SC
[5] U-M-MC
确定使用序列 3 和序列 4 的 LCP,可以直接计算 LCS 度量:
> seqLLCP(seq[3,], seq[4,])
[1] 4
> seqLLCS(seq[1,], seq[2,])
[1] 2