Chapter 4 Data Mining

Process mining builds on two pillars: (a) process modeling and analysis and (b) data mining.

a data sets

4.1 Classification of Data Mining Techniques

In data mining is defined as "the analysis of data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner".

4.1.1 Data Sets: Instances and Variables

categorical variables have a limited set of possible values and can easily be enumerated.
numerical variables have an ordering and cannot be enumerated easily.

Categorical variables are typically subdivided into ordinal variables and nominal variables.

Nominal variables have no logical ordering.
Ordinal variables have an ordering associated to it.

Data mining techniques make less assumptions about the format of the input data than process mining techniques.

We classify data mining techniques into two main categories: supervised learning and unsupervised learning.

4.1.2 Supervised Learning: Classification and Regression

Supervised learning assumes labeled data, i.e., there is a response variable (因变量) that labels each instance.

Techniques for supervised learning can be further subdivided into classification and regression depending on the type of response variable (categorical or numerical).

Classification techniques
Classification techniques assume a categorical response variable and the goal is to classify instances based on the predictor variables.
Regression techniques
Regression techniques assume a numerical response variable. The goal is to find a function that fits the data with the least error.
The most frequently used regression technique is linear regression.

Classification requires a categorical response variable. In some cases it makes sense to transform a numerical response variable into a categorical one.

4.1.3 Unsupervised Learning: Clustering and Pattern Discovery

Unsupervised learning assumes unlabeled data, i.e., the variable are not split into response and predictor variable.

Clustering algorithms
Clustering algorithms examine the data to find groups of instance that are similar. Well-known techniques for clustering are k-means clustering and agglomerative hierarchical clustering.
Pattern Discovery
There are many techniques to discover patterns in data. Often the goal is to find rules of the form
IF X THEN Y where X and Y relate values of different variables. The most well-known technique is association rule mining (关联规则).

目前对4.1章的理解：

4.1.1章主要介绍了一些数据挖掘的概念。

数据挖掘是通过分析数据集来找到一些数据间潜在的关系。
变量分为 categorical variables (分类变量?) 和 numerical variables (数值变量?)
根据逻辑上是否有序，categorical variables 又可分为 Nominal variables 和 Ordinal variables.
数据挖掘对格式的需求比过程挖掘更低。（这点有待思考）
数据挖掘可分为两大类：supervised learning (监管学习) 和 unsupervised learning (无监管学习).

4.1.2章主要介绍了supervised learning.

supervised learning 需要 labeled data (在一个对象里存在因变量).
又根据因变量的类型 (categorical or numerical) 分为 classification techniques 和 regression techniques.
classification 希望根据 categorical response variables 将对象分类。
regression 希望找到一个尽可能误差小的函数关系式，描述因变量与自变量的关系。

4.1.3章主要介绍了unsupervised learning.

unsupervised learning 需要 unlabeled data (不区分自变量和因变量).
clustering 希望将数据划分为(同质的?)若干组。
pattern discovery 希望发现数据的模式 (数据间的关联).

4.2 Decision Tree Learning

Decision tree learning is a supervised learning technique aiming at the classification of instances based on predictor variables.
There is one categorical response variable labeling the data and the result is arranged in the form of a tree.

Leaf nodes correspond to possible values of the response variable.
Non-leaf nodes correspond to predictor variables.

In the context of decision tree learning, predictor variables are referred to as attributes.
Every attribute node splits a set of instances into two or more subsets. The root node corresponds to all instances.

A decision tree

All leaf nodes have two numbers. The first number indicates the number of instances classified as such.
The second number indicates the number of instances corresponding to the leaf node but wrongly classified.

An attribute may appear multiple times in a tree

Based on an attribute, a set of instances may also be split into three (or even more) subsets. An attribute may appear multiple times in a tree but not twice on the same path.

Decision trees such as the ones shown can be obtained using a variety of techniques. Most of the techniques use a recursive top-down algorithm that works as follows:

1. Create the root node r and associate all instances to the root node. X := {r} is the set of nodes to be traversed.
1. If X =∅, then return the tree with root r and end.
1. Select x ∈ X and remove it from X, i.e., X := X \ {x}. Determine the “score” sold(x) of node x before splitting, e.g., based on entropy.
1. Determine if splitting is possible/needed. If not, go to step 2, otherwise continue with the next step.
1. For all possible attributes a ∈ A, evaluate the effects of splitting on the attribute. Select the attribute a providing the best improvement. The same attribute should not appear multiple times on the same path from the root. Also note that for numerical attributes, so-called “cut values” need to be determined.
1. If the improvement is substantial enough, create a set of child nodes Y , add Y to X (i.e., X := X ∪ Y ), and connect x to all child nodes in Y .
1. Associate each node in Y to its corresponding set of instances and go to step 2.

These are just few of the many ingredients that determine a complete decision tree learning algorithm.
The crucial thing to see is that by splitting the set of instances in subsets the variation within each subset becomes smaller. This can be best illustrated using the notion of entropy.

Entropy: Encoding uncertainty

Entropy is an information-theoretic measure for the uncertainly in a multi-set of elements. If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many “bits” to encode the individual elements. Hence, the entropy is “high”. If all elements in the multi-set are the same, then actually no bits are needed to encode the individual elements. In this case the entropy is “low”. For example, the entropy of the multi-set [a, b, c, d, e] is much higher than the entropy of the multi-set [a5] even though both multi-sets have the same number of elements (5).
Assume that there is a multi-set X with n elements and there are k possible values, say v1, v2, . . . , vk , i.e., X is a multi-set over V = {v1, v2, . . . , vk} with |X| = n. Each value vi appears ci times in X, i.e., X = [(v1)c1, (v2)c2, . . . , (vk)ck ]. Without loss of generality, we can assume that ci ≥ 1 for all i, because values that do not appear in X can be removed from V upfront. The proportion of elements having value vi is pi , i.e., pi = ci/n. The entropy of X is measured in bits of information and is defined by the formula:
$E = -\sum_{i=1}^{k}p_{i} log_{2} p_{i}$
If all elements in X have the same value, i.e., k = 1 and $p_{1} = 1$ , then $E = -log_{2}1 = 0$ This means that no bits are needed to encode the value of an individual element; they are all the same anyway. If all elements in X are different, i.e., k = n and pi = 1/k, then $E = -\sum _{i = 1}^{k} (1/k)log_{2}(1/k) = log_{2}k$ . For instance, if there are 4 possible values, then $E = log_{2}4 = 2$ bits are needed to encode each individual element. If there are 16 possible values, then $E = log_{2} 16 = 4$ bits are needed to encode each individual element.

Entropy is just one of several measures that can be used to measure the diversity in a leaf node. Another measure is the Gini index of diversity that measures the “impurity” of a data set: $G=1-\sum _{i=1}^{k}(p_{i})^{2}$ . If all classifications are the same, then G = 0. G approaches 1 as there is more and more diversity. Hence, an approach can be to select the attribute that maximizes the reduction of the G value (rather than the E value).

4.3 k-Means Clustering

Clustering is concerned with grouping instances into clusters. Instances in one cluster should be similar to each other and dissimilar to instances in other clusters. Clustering uses unlabeled data and, hence, requires an unsupervised learning technique.

Clustering instances in three clusters using k-means

Assume we have a data set with only two variables: age and weight. Through a clustering technique like k-means, the three clusters shown on the right-hand-side can be discovered. Ideally, the instances in one cluster are close to one another while being further away from instances in other clusters.
Each of the clusters has a centroid denoted by a . The centroid denotes the “center” of the cluster and can be computed by taking the average of the coordinates of the instances in the cluster.

Distance-based clustering algorithms like k-means and agglomerative hierarchical clustering assume a distance notion. The most common approach is to consider each instance to be an n-dimensional vector where n is the number of variables and then simply take the Euclidian distance. For this purpose ordinal values but also binary values need to be made numeric.

Step-by-step evolution k-means

That shows the basic idea of k-means clustering. Here, we simplified things as much as possible, i.e., k = 2 and there are only 10 instances. The approach starts with a random initialization of two centroids denoted by the two + symbols. （这里我不太理解，为什么 instance 的点会动呢，动的应该是 centroids）

The result depends on the initialization. Therefore, it is good to repeatedly execute the algorithm with different initializations and select the best one.

Another popular clustering technique is Agglomerative Hierarchical Clustering(AHC 聚合层次聚类).
The approach works as follows. Assign each instance to a dedicated singleton cluster. Now search for the two clusters that are closest to one another. Merge these two clusters into a new cluster. For example, the initial clusters consisting of just a and just b are merged into a new cluster ab. Now search again for the two clusters that are closest to one another and merge them. This is repeated until all instances are in the same cluster.

Agglomerative hierarchical clustering: (a) clusters and (b) dendrogram

Any horizontal line in dendrogram corresponds to a concrete clustering at a particular level of abstraction

Clustering is only indirectly related to process discovery. Nevertheless, clustering can be used as a preprocessing step for process mining.
If the process model discovered for all cases is too complex to comprehend, then it may be useful to first identify clusters and then discover simpler models per cluster.

4.4 Association Rule Learning

Decision trees can be used to predict the value of some response variable that has been identified as being important. Driven by the response variable, rules like “people who drink and smoke die before 70” can be found. Association rule learning aims at finding similar rules but now without focusing on a particular response variable. The goal is to find rules of the form IF X THEN Y where X is often called the antecedent and Y the consequent. Such rules are also denoted as X⇒Y. X and Y can be any conjunction of “variable = value” terms. The only requirement is that X and Y are nonempty and any variable appears at most once in X and Y.

When learning association rules of the form X⇒Y , three metrics are frequently used: support, confidence, and lift. Let NX be the number of instances for which X holds. NY is the number of instances for which Y holds. NX∧Y is the number of instances for which both X and Y hold. N is the total number of instances.
The support of a rule X⇒Y is defined as $support(X⇒Y) = N_{X\wedge Y}/N$
The support indicates the applicability of the approach, i.e., the fraction of instances for which with both antecedent and consequent hold. Typically a rule with high support is more useful than a rule with low support.

The confidence of a rule X⇒Y is defined as $confidence(X⇒Y) = N_{X\wedge Y}/N_{X}$
A rule with high confidence, i.e., a value close to 1, indicates that the rule is very reliable, i.e., if X holds, then Y will also hold. A rule with high confidence is definitely more useful than a rule with low confidence.

The lift of a rule X⇒Y is defined as $lift(X⇒Y) = \frac{N_{X\wedge Y}/N}{(N_{X}/N)(N_{Y}/N)} = \frac{N_{X\wedge Y}N}{(N_{X})(N_{Y})}$
If X and Y are independent, then the lift will be close to 1. If lift(X⇒Y)>1, then X and Y correlate positively. For example lift(X ⇒ Y) = 5 means that X and Y happen five times more together than what would be the case if they were independent. If lift(X⇒Y)<1, then X and Y correlate negatively (i.e., the occurrence of X makes Y less likely and
vice versa). Rules with a higher lift value are generally considered to be more interesting. However, typically lift values are only considered if certain thresholds with respect to support and confidence are met.

Association rules can now be generated as follows:

1. Generate all frequent item-sets, i.e., all sets Z such that NZ/N ≥ minsup and |Z| ≥ 2.
1. For each frequent item-set Z consider all partitionings of Z into two non-empty subsets X and Y .
  If confidence(X ⇒Y) ≥ minconf, then keep the rule X⇒Y . If confidence(X⇒Y)<minconf, then discard the rule.
1. Output the rules found.

This simple algorithm has two problems. First of all, there is a computational problem related to the first step. If there are m variables, then there are 2m − m − 1 possible item-sets. Hence, for 100 products (m = 100) there are $1267650600228229401496703205275$ candidate frequent item-sets. The second problem is that many uninteresting rules are generated. For example, after presenting the rule tea ∧ latte⇒muffin, there is no point in also showing tea⇒latte ∧ muffin even when it meets the minsup and minconf thresholds. Many techniques have been developed to speed-up the generation of association rules and to select the most interesting rules. Here we only sketch the seminal Apriori algorithm.

Apriori: Efficiently generating frequent item-sets

The Apriori algorithm is one of the best known algorithms in computer science. The algorithm, initially developed by Agrawal and Srikant, is able to speed up the generation of association rules by exploiting the following two observations:

1. If an item-set is frequent (i.e., an item-set with a support above the threshold (阈值)), then all of its non-empty subsets are also frequent. Formally, for any pair of non-empty item-sets X, Y: if Y ⊆ X and $N_{X}/N$ ≥ minsup, then $N_{Y}/N$ ≥ minsup.
1. If, for any k, $I_{k}$ is the set of all frequent item-sets with cardinality k and Il =∅ for some l, then Ik =∅ for all k ≥ l.
  这里有点没有理解，找了一个中文的描述。

https://www.cnblogs.com/junyuhuang/p/5572364.html

Association rules are related to process discovery. Recall that the α-algorithm also traverses the event log looking for patterns. However, association rules do not consider the ordering of activities and do not aim to build an overall process model.

4.5 Sequence and Episode Mining

The Apriori algorithm uses the monotonicity property that all subsets of a frequent item-set are also frequent.Many other pattern or rule discovery problems have similar monotonicity properties, thus enabling efficient implementations. A well-known example is the mining of sequential patterns.

4.5.1 Sequence Mining

The Apriori algorithm does not consider the ordering of events. Sequence mining overcomes this problem by analyzing sequences of item-sets.
A sequence is frequent if the pattern is contained in a predeﬁned proportion of the customer sequences in the data set.

define of a subsequence

4.5.2 Episode Mining

Episode mining and sequence mining can be seen as variants of association rule learning. Because they take into account the ordering of events, they are related to process discovery. However, there are many differences with process mining algorithms.

4.5.3 Other Approaches

neural networks
hidden Markov models

4.6 Quality of Resulting Models

4.6.1 Measuring the Performance of a Classiﬁer

Like in data mining it is non-trivial to analyze the quality of process mining results.
confusion matrix.

Confusion matrix for two classes and some performance measures for classiﬁers

4.6.2 Cross-Validation

Cross-validation using a test and training set

It is important to realize that cross-validation is not limited to classiﬁcation but can be used for any data mining technique.'

4.6.3 Occam’s Razor

这个暂时搁置，以后再看。

Chapter 4 Data Mining