data mining 知识大纲

说明:该知识大纲是根据电子科技大学计算机学院研究生学位课《Data Mining》的授课内容整理而成。该课程由邵俊明老师进行讲授,且是英文授课。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

image.png

Key factors:

  • Data storage

  • Data availability

  • Computation power

Application:

  • Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

  • Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.

  • Resource planning: summarize and compare the resources and spending

  • Fraud detection

Task:

  • Association rule mining

  • Cluster analysis

  • Classification

  • Outlier detection

Direction:

  • Volume (Scale of Data)

  • Velocity (Data Stream)

  • Variety (Different format of data, difference sources)

  • Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

image.png

Predict class label of test instance with major vote strategy

SVM Kernel

image.png

Ensemble learning

  1. bagging: random forest

  2. boosting: adaboost

  3. stacking

image.png
image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

  • Curse of dimensionality

  • Storage cost

  • Query speed

Examples:

  • Information retrieval

  • Storage cost

  • Fast nearest neighbor search

Three steps for similar documents:

  • shingling

  • Min hashing

  • Locality-sensitive hashing

image.png

Min-hashing

  1. Compute signatures of columns = small summaries of columns.

  2. Examine pairs of signatures to find similar signatures.

  3. (Optional) check that columns with similar signatures are really similar.

image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

  • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.

  • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

image.png

For each band, hash its portion of each column to a hash table with k buckets.

image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

  • PCA hashing: The basic idea is rotating the data to minimize quantization loss.

  • Spectral hashing

image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

image.png
image.png

Markov chain Monte Carlo(MCMC)

image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

image.png

other features of Gibbs:

  • Do not need p(x)
image.png

Sampling on data stream

  • Bernoulli Sampling

  • Reservoir Sampling: not need to know stream size;
    image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

image.png

Challenges:

  • Single Pass Handling

  • Memory Limitation

  • Low Time Complexity

  • Concept Drift

What is concept drift?

The probability distribution changes.

image.png
image.png

Concept drift detection

  • Distribution-based detector

  • Error-rate based detector

image.png
image.png

Data stream classification

image.png

Typical algorithm

  • VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm

  • CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

image.png
image.png
image.png
image.png
image.png

CVFDT

image.png
image.png
image.png
image.png

Data stream clustering

  • Online phase: Summarize the data into memory-efficient data structures

  • Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

image.png
image.png
image.png
image.png

Chapter 6

Key node identification

image.png
image.png

PageRank

image.png
image.png

Community detection (graph clustering)

  • Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
    image.png

    But minimum cut always return an imbalanced partition.

  • Normalized cut & ratio cut
    image.png

    ,prefer a balanced partition.

  • modularity
    image.png
  • Random walk

  • Multi-level clustering

  • Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.

    • View network as dynamical system (Dynamic vs. Static)

    • Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)

    • All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

  • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

  • Hadoop is open-source implementation for Google MapReduce

  • Hadoop is based on a simple programming model called MapReduce

  • Hadoop is based on a simple data model, any data will fit

image.png
image.png
image.png

Core
Filesystems and I/O:

  • Abstraction APIs
  • RPC / Persistence

Avro
Cross-language serialization:

  • RPC / persistence
  • ~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

  • Programming model
  • Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

  • Replication / scalability
  • ~ Google filesystem (GFS)

Zoo keeper
Coordination service

  • Locking / configuration
  • ~ Google Chubby

HBase
Column-oriented, sparse store

  • Batch & random access
  • ~ Google BigTable

Pig
Data flow language

  • Procedural SQL-inspired lang.
  • Execution environment

Hive
Distributed data warehouse

  • SQL-like query language
  • Data mgmt / query execution

image.png
image.png
image.png
image.png
image.png

Limitation of MapReduce

  • Inefficient for multi-pass algorithm

  • No efficient primitives for data sharing

    • State between steps goes to distributed file system

    • Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

image.png
image.png
image.png
image.png
image.png
image.png
image.png

Row-key is unique for a row

image.png
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容