说明:该知识大纲是根据电子科技大学计算机学院研究生学位课《Data Mining》的授课内容整理而成。该课程由邵俊明老师进行讲授,且是英文授课。
Chapter 1
Definition of data mining
Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data
Key factors:
Data storage
Data availability
Computation power
Application:
Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.
Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.
Resource planning: summarize and compare the resources and spending
Fraud detection
Task:
Association rule mining
Cluster analysis
Classification
Outlier detection
Direction:
Volume (Scale of Data)
Velocity (Data Stream)
Variety (Different format of data, difference sources)
Veracity (Uncertainty, missing value)
Chapter 2
Nearest Neighbor
Predict class label of test instance with major vote strategy
SVM Kernel
Ensemble learning
bagging: random forest
boosting: adaboost
stacking
Chapter 3
Why do we need Hashing?
Challenge in big data applications:
Curse of dimensionality
Storage cost
Query speed
Examples:
Information retrieval
Storage cost
Fast nearest neighbor search
Three steps for similar documents:
shingling
Min hashing
Locality-sensitive hashing
Min-hashing
Compute signatures of columns = small summaries of columns.
Examine pairs of signatures to find similar signatures.
(Optional) check that columns with similar signatures are really similar.
Use several (e.g., 100) independent hash functions to create a signature.
Locality-sensitive hashing
General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.
For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.
LSH for min-hash signatures
Matrix M is the matrix of signatures.
For each band, hash its portion of each column to a hash table with k buckets.
Tradeoff
Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.
Learn to hash
PCA hashing: The basic idea is rotating the data to minimize quantization loss.
Spectral hashing
Chapter 4
Definition of sampling
Giving a p(x), we want to draw some samples to represent p(x).
Inverse transform sampling
Drawbacks: Usually, it’s hard to get the inverse function
Rejection sampling
Adaptive reject sampling: only if p(x) is log-concave
Importance sampling
Markov chain Monte Carlo(MCMC)
Detailed balance condition: π(i)Pij = π(j)Pij
Acceptance ratio
Drawbacks: acceptance ratio is too small
Metropolis–Hastings (MH) Sampling
Based on MCMC rewriting the acceptance ratio
But acceptance ratio still isn’t 100%
Gibbs sampling (based on MCMC)
Idea: Gibbs sampling further make acceptance ratio being 100%
other features of Gibbs:
- Do not need p(x)
Sampling on data stream
Bernoulli Sampling
-
Reservoir Sampling: not need to know stream size;
Chapter 5
What is data stream?
A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift
Challenges:
Single Pass Handling
Memory Limitation
Low Time Complexity
Concept Drift
What is concept drift?
The probability distribution changes.
Concept drift detection
Distribution-based detector
Error-rate based detector
Data stream classification
Typical algorithm
VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm
CVFDT (Concept-adapting Very Fast Decision Tree learner)
VFDT
CVFDT
Data stream clustering
Online phase: Summarize the data into memory-efficient data structures
Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)
Framework
Chapter 6
Key node identification
PageRank
Community detection (graph clustering)
-
Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
But minimum cut always return an imbalanced partition.
-
Normalized cut & ratio cut
,prefer a balanced partition.
-
modularity
Random walk
Multi-level clustering
-
Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.
View network as dynamical system (Dynamic vs. Static)
Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)
All edge distances will converge, and the community structure is intuitively identified.
Chapter 7
What is hadoop?
Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
Core
Filesystems and I/O:
- Abstraction APIs
- RPC / Persistence
Avro
Cross-language serialization:
- RPC / persistence
- ~ Google ProtoBuf / FB Thrift
MapReduce
Distributed execution (batch)
- Programming model
- Scalability / fault-tolerance
HDFS
Distributed storage (read-opt.)
- Replication / scalability
- ~ Google filesystem (GFS)
Zoo keeper
Coordination service
- Locking / configuration
- ~ Google Chubby
HBase
Column-oriented, sparse store
- Batch & random access
- ~ Google BigTable
Pig
Data flow language
- Procedural SQL-inspired lang.
- Execution environment
Hive
Distributed data warehouse
- SQL-like query language
- Data mgmt / query execution
Limitation of MapReduce
Inefficient for multi-pass algorithm
-
No efficient primitives for data sharing
State between steps goes to distributed file system
Slow due to replication & disk storage
Spark
Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.
Row-key is unique for a row