Chapter 4 Data Mining

Process mining builds on two pillars: (a) process modeling and analysis and (b) data mining.

a data sets

4.1 Classification of Data Mining Techniques

In data mining is defined as "the analysis of data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner".

4.1.1 Data Sets: Instances and Variables

  • categorical variables have a limited set of possible values and can easily be enumerated.
  • numerical variables have an ordering and cannot be enumerated easily.

Categorical variables are typically subdivided into ordinal variables and nominal variables.

  • Nominal variables have no logical ordering.
  • Ordinal variables have an ordering associated to it.

Data mining techniques make less assumptions about the format of the input data than process mining techniques.

We classify data mining techniques into two main categories: supervised learning and unsupervised learning.

4.1.2 Supervised Learning: Classification and Regression

Supervised learning assumes labeled data, i.e., there is a response variable (因变量) that labels each instance.

Techniques for supervised learning can be further subdivided into classification and regression depending on the type of response variable (categorical or numerical).

  • Classification techniques
    Classification techniques assume a categorical response variable and the goal is to classify instances based on the predictor variables.

  • Regression techniques
    Regression techniques assume a numerical response variable. The goal is to find a function that fits the data with the least error.
    The most frequently used regression technique is linear regression.

Classification requires a categorical response variable. In some cases it makes sense to transform a numerical response variable into a categorical one.

4.1.3 Unsupervised Learning: Clustering and Pattern Discovery

Unsupervised learning assumes unlabeled data, i.e., the variable are not split into response and predictor variable.

  • Clustering algorithms
    Clustering algorithms examine the data to find groups of instance that are similar. Well-known techniques for clustering are k-means clustering and agglomerative hierarchical clustering.

  • Pattern Discovery
    There are many techniques to discover patterns in data. Often the goal is to find rules of the form
    IF X THEN Y where X and Y relate values of different variables. The most well-known technique is association rule mining (关联规则).

目前对4.1章的理解:
4.1.1章主要介绍了一些数据挖掘的概念。

数据挖掘是通过分析数据集来找到一些数据间潜在的关系。
变量分为 categorical variables (分类变量?) 和 numerical variables (数值变量?)
根据逻辑上是否有序,categorical variables 又可分为 Nominal variables 和 Ordinal variables.
数据挖掘对格式的需求比过程挖掘更低。(这点有待思考)
数据挖掘可分为两大类:supervised learning (监管学习) 和 unsupervised learning (无监管学习).

4.1.2章主要介绍了supervised learning.

supervised learning 需要 labeled data (在一个对象里存在因变量).
又根据因变量的类型 (categorical or numerical) 分为 classification techniques 和 regression techniques.
classification 希望根据 categorical response variables 将对象分类。
regression 希望找到一个尽可能误差小的函数关系式,描述因变量与自变量的关系。

4.1.3章主要介绍了unsupervised learning.

unsupervised learning 需要 unlabeled data (不区分自变量和因变量).
clustering 希望将数据划分为(同质的?)若干组。
pattern discovery 希望发现数据的模式 (数据间的关联).


4.2 Decision Tree Learning

Decision tree learning is a supervised learning technique aiming at the classification of instances based on predictor variables.
There is one categorical response variable labeling the data and the result is arranged in the form of a tree.

Leaf nodes correspond to possible values of the response variable.
Non-leaf nodes correspond to predictor variables.

In the context of decision tree learning, predictor variables are referred to as attributes.
Every attribute node splits a set of instances into two or more subsets. The root node corresponds to all instances.

A decision tree

All leaf nodes have two numbers. The first number indicates the number of instances classified as such.
The second number indicates the number of instances corresponding to the leaf node but wrongly classified.

An attribute may appear multiple times in a tree

Based on an attribute, a set of instances may also be split into three (or even more) subsets. An attribute may appear multiple times in a tree but not twice on the same path.

Decision trees such as the ones shown can be obtained using a variety of techniques. Most of the techniques use a recursive top-down algorithm that works as follows:

    1. Create the root node r and associate all instances to the root node. X := {r} is the set of nodes to be traversed.
    1. If X =∅, then return the tree with root r and end.
    1. Select x ∈ X and remove it from X, i.e., X := X \ {x}. Determine the “score” sold(x) of node x before splitting, e.g., based on entropy.
    1. Determine if splitting is possible/needed. If not, go to step 2, otherwise continue with the next step.
    1. For all possible attributes a ∈ A, evaluate the effects of splitting on the attribute. Select the attribute a providing the best improvement. The same attribute should not appear multiple times on the same path from the root. Also note that for numerical attributes, so-called “cut values” need to be determined.
    1. If the improvement is substantial enough, create a set of child nodes Y , add Y to X (i.e., X := X ∪ Y ), and connect x to all child nodes in Y .
    1. Associate each node in Y to its corresponding set of instances and go to step 2.

These are just few of the many ingredients that determine a complete decision tree learning algorithm.
The crucial thing to see is that by splitting the set of instances in subsets the variation within each subset becomes smaller. This can be best illustrated using the notion of entropy.

Entropy: Encoding uncertainty

Entropy is an information-theoretic measure for the uncertainly in a multi-set of elements. If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many “bits” to encode the individual elements. Hence, the entropy is “high”. If all elements in the multi-set are the same, then actually no bits are needed to encode the individual elements. In this case the entropy is “low”. For example, the entropy of the multi-set [a, b, c, d, e] is much higher than the entropy of the multi-set [a5] even though both multi-sets have the same number of elements (5).
Assume that there is a multi-set X with n elements and there are k possible values, say v1, v2, . . . , vk , i.e., X is a multi-set over V = {v1, v2, . . . , vk} with |X| = n. Each value vi appears ci times in X, i.e., X = [(v1)c1, (v2)c2, . . . , (vk)ck ]. Without loss of generality, we can assume that ci ≥ 1 for all i, because values that do not appear in X can be removed from V upfront. The proportion of elements having value vi is pi , i.e., pi = ci/n. The entropy of X is measured in bits of information and is defined by the formula:
E = -\sum_{i=1}^{k}p_{i} log_{2} p_{i}
If all elements in X have the same value, i.e., k = 1 and p_{1} = 1, then E = -log_{2}1 = 0 This means that no bits are needed to encode the value of an individual element; they are all the same anyway. If all elements in X are different, i.e., k = n and pi = 1/k, then E = -\sum _{i = 1}^{k} (1/k)log_{2}(1/k) = log_{2}k. For instance, if there are 4 possible values, then E = log_{2}4 = 2 bits are needed to encode each individual element. If there are 16 possible values, then E = log_{2} 16 = 4 bits are needed to encode each individual element.

Entropy is just one of several measures that can be used to measure the diversity in a leaf node. Another measure is the Gini index of diversity that measures the “impurity” of a data set: G=1-\sum _{i=1}^{k}(p_{i})^{2}. If all classifications are the same, then G = 0. G approaches 1 as there is more and more diversity. Hence, an approach can be to select the attribute that maximizes the reduction of the G value (rather than the E value).


4.3 k-Means Clustering

Clustering is concerned with grouping instances into clusters. Instances in one cluster should be similar to each other and dissimilar to instances in other clusters. Clustering uses unlabeled data and, hence, requires an unsupervised learning technique.

Clustering instances in three clusters using k-means

Assume we have a data set with only two variables: age and weight. Through a clustering technique like k-means, the three clusters shown on the right-hand-side can be discovered. Ideally, the instances in one cluster are close to one another while being further away from instances in other clusters.
Each of the clusters has a centroid denoted by a . The centroid denotes the “center” of the cluster and can be computed by taking the average of the coordinates of the instances in the cluster.

Distance-based clustering algorithms like k-means and agglomerative hierarchical clustering assume a distance notion. The most common approach is to consider each instance to be an n-dimensional vector where n is the number of variables and then simply take the Euclidian distance. For this purpose ordinal values but also binary values need to be made numeric.

Step-by-step evolution k-means

That shows the basic idea of k-means clustering. Here, we simplified things as much as possible, i.e., k = 2 and there are only 10 instances. The approach starts with a random initialization of two centroids denoted by the two + symbols. (这里我不太理解,为什么 instance 的点会动呢,动的应该是 centroids)

The result depends on the initialization. Therefore, it is good to repeatedly execute the algorithm with different initializations and select the best one.

Another popular clustering technique is Agglomerative Hierarchical Clustering(AHC 聚合层次聚类).
The approach works as follows. Assign each instance to a dedicated singleton cluster. Now search for the two clusters that are closest to one another. Merge these two clusters into a new cluster. For example, the initial clusters consisting of just a and just b are merged into a new cluster ab. Now search again for the two clusters that are closest to one another and merge them. This is repeated until all instances are in the same cluster.

Agglomerative hierarchical clustering: (a) clusters and (b) dendrogram
Any horizontal line in dendrogram corresponds to a concrete clustering at a particular level of abstraction

Clustering is only indirectly related to process discovery. Nevertheless, clustering can be used as a preprocessing step for process mining.
If the process model discovered for all cases is too complex to comprehend, then it may be useful to first identify clusters and then discover simpler models per cluster.


4.4 Association Rule Learning

Decision trees can be used to predict the value of some response variable that has been identified as being important. Driven by the response variable, rules like “people who drink and smoke die before 70” can be found. Association rule learning aims at finding similar rules but now without focusing on a particular response variable. The goal is to find rules of the form IF X THEN Y where X is often called the antecedent and Y the consequent. Such rules are also denoted as X⇒Y. X and Y can be any conjunction of “variable = value” terms. The only requirement is that X and Y are nonempty and any variable appears at most once in X and Y.

When learning association rules of the form X⇒Y , three metrics are frequently used: support, confidence, and lift. Let NX be the number of instances for which X holds. NY is the number of instances for which Y holds. NX∧Y is the number of instances for which both X and Y hold. N is the total number of instances.
The support of a rule X⇒Y is defined assupport(X⇒Y) = N_{X\wedge Y}/N
The support indicates the applicability of the approach, i.e., the fraction of instances for which with both antecedent and consequent hold. Typically a rule with high support is more useful than a rule with low support.

The confidence of a rule X⇒Y is defined as confidence(X⇒Y) = N_{X\wedge Y}/N_{X}
A rule with high confidence, i.e., a value close to 1, indicates that the rule is very reliable, i.e., if X holds, then Y will also hold. A rule with high confidence is definitely more useful than a rule with low confidence.

The lift of a rule X⇒Y is defined aslift(X⇒Y) = \frac{N_{X\wedge Y}/N}{(N_{X}/N)(N_{Y}/N)} = \frac{N_{X\wedge Y}N}{(N_{X})(N_{Y})}
If X and Y are independent, then the lift will be close to 1. If lift(X⇒Y)>1, then X and Y correlate positively. For example lift(X ⇒ Y) = 5 means that X and Y happen five times more together than what would be the case if they were independent. If lift(X⇒Y)<1, then X and Y correlate negatively (i.e., the occurrence of X makes Y less likely and
vice versa). Rules with a higher lift value are generally considered to be more interesting. However, typically lift values are only considered if certain thresholds with respect to support and confidence are met.

Association rules can now be generated as follows:

    1. Generate all frequent item-sets, i.e., all sets Z such that NZ/N ≥ minsup and |Z| ≥ 2.
    1. For each frequent item-set Z consider all partitionings of Z into two non-empty subsets X and Y .
      If confidence(X ⇒Y) ≥ minconf, then keep the rule X⇒Y . If confidence(X⇒Y)<minconf, then discard the rule.
    1. Output the rules found.

This simple algorithm has two problems. First of all, there is a computational problem related to the first step. If there are m variables, then there are 2m − m − 1 possible item-sets. Hence, for 100 products (m = 100) there are 1267650600228229401496703205275candidate frequent item-sets. The second problem is that many uninteresting rules are generated. For example, after presenting the rule tea ∧ latte⇒muffin, there is no point in also showing tea⇒latte ∧ muffin even when it meets the minsup and minconf thresholds. Many techniques have been developed to speed-up the generation of association rules and to select the most interesting rules. Here we only sketch the seminal Apriori algorithm.

Apriori: Efficiently generating frequent item-sets

The Apriori algorithm is one of the best known algorithms in computer science. The algorithm, initially developed by Agrawal and Srikant, is able to speed up the generation of association rules by exploiting the following two observations:

    1. If an item-set is frequent (i.e., an item-set with a support above the threshold (阈值)), then all of its non-empty subsets are also frequent. Formally, for any pair of non-empty item-sets X, Y: if Y ⊆ X and N_{X}/N ≥ minsup, then N_{Y}/N ≥ minsup.
    1. If, for any k, I_{k} is the set of all frequent item-sets with cardinality k and Il =∅ for some l, then Ik =∅ for all k ≥ l.
      这里有点没有理解,找了一个中文的描述。

https://www.cnblogs.com/junyuhuang/p/5572364.html

Association rules are related to process discovery. Recall that the α-algorithm also traverses the event log looking for patterns. However, association rules do not consider the ordering of activities and do not aim to build an overall process model.


4.5 Sequence and Episode Mining

The Apriori algorithm uses the monotonicity property that all subsets of a frequent item-set are also frequent.Many other pattern or rule discovery problems have similar monotonicity properties, thus enabling efficient implementations. A well-known example is the mining of sequential patterns.

4.5.1 Sequence Mining

The Apriori algorithm does not consider the ordering of events. Sequence mining overcomes this problem by analyzing sequences of item-sets.
A sequence is frequent if the pattern is contained in a predefined proportion of the customer sequences in the data set.


define of a subsequence

4.5.2 Episode Mining

Episode mining and sequence mining can be seen as variants of association rule learning. Because they take into account the ordering of events, they are related to process discovery. However, there are many differences with process mining algorithms.

4.5.3 Other Approaches

  • neural networks
  • hidden Markov models

4.6 Quality of Resulting Models

4.6.1 Measuring the Performance of a Classifier

Like in data mining it is non-trivial to analyze the quality of process mining results.
confusion matrix.


Confusion matrix for two classes and some performance measures for classifiers

4.6.2 Cross-Validation

Cross-validation using a test and training set

It is important to realize that cross-validation is not limited to classification but can be used for any data mining technique.'

4.6.3 Occam’s Razor

这个暂时搁置,以后再看。

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 220,639评论 6 513
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 94,093评论 3 396
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 167,079评论 0 357
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 59,329评论 1 295
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 68,343评论 6 397
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 52,047评论 1 308
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 40,645评论 3 421
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 39,565评论 0 276
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 46,095评论 1 319
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 38,201评论 3 340
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 40,338评论 1 352
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 36,014评论 5 347
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 41,701评论 3 332
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 32,194评论 0 23
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 33,320评论 1 272
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 48,685评论 3 375
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 45,345评论 2 358

推荐阅读更多精彩内容