data mining 知识大纲

说明:该知识大纲是根据电子科技大学计算机学院研究生学位课《Data Mining》的授课内容整理而成。该课程由邵俊明老师进行讲授,且是英文授课。

Chapter 1

Definition of data mining

Data mining consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

image.png

Key factors:

  • Data storage

  • Data availability

  • Computation power

Application:

  • Target marketing: Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc.

  • Cross-market analysis: Find associations/co-relations between product sales, & predict based on such association.

  • Resource planning: summarize and compare the resources and spending

  • Fraud detection

Task:

  • Association rule mining

  • Cluster analysis

  • Classification

  • Outlier detection

Direction:

  • Volume (Scale of Data)

  • Velocity (Data Stream)

  • Variety (Different format of data, difference sources)

  • Veracity (Uncertainty, missing value)

Chapter 2

Nearest Neighbor

image.png

Predict class label of test instance with major vote strategy

SVM Kernel

image.png

Ensemble learning

  1. bagging: random forest

  2. boosting: adaboost

  3. stacking

image.png
image.png

Chapter 3

Why do we need Hashing?

Challenge in big data applications:

  • Curse of dimensionality

  • Storage cost

  • Query speed

Examples:

  • Information retrieval

  • Storage cost

  • Fast nearest neighbor search

Three steps for similar documents:

  • shingling

  • Min hashing

  • Locality-sensitive hashing

image.png

Min-hashing

  1. Compute signatures of columns = small summaries of columns.

  2. Examine pairs of signatures to find similar signatures.

  3. (Optional) check that columns with similar signatures are really similar.

image.png

Use several (e.g., 100) independent hash functions to create a signature.

Locality-sensitive hashing

  • General idea: Use a function f(x,y) that tells whether or not x and y is a candidate pair : a pair of elements whose similarity must be evaluated.

  • For minhash matrices: Hash columns to many buckets, and make elements of the same bucket candidate pairs.

LSH for min-hash signatures

Matrix M is the matrix of signatures.

image.png

For each band, hash its portion of each column to a hash table with k buckets.

image.png

Tradeoff

Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives.

Learn to hash

  • PCA hashing: The basic idea is rotating the data to minimize quantization loss.

  • Spectral hashing

image.png

Chapter 4

Definition of sampling

Giving a p(x), we want to draw some samples to represent p(x).

Inverse transform sampling

image.png

Drawbacks: Usually, it’s hard to get the inverse function

Rejection sampling

image.png

Adaptive reject sampling: only if p(x) is log-concave

Importance sampling

image.png
image.png

Markov chain Monte Carlo(MCMC)

image.png

Detailed balance condition: π(i)Pij = π(j)Pij

Acceptance ratio

image.png

Drawbacks: acceptance ratio is too small

Metropolis–Hastings (MH) Sampling

Based on MCMC rewriting the acceptance ratio

image.png

But acceptance ratio still isn’t 100%

Gibbs sampling (based on MCMC)

Idea: Gibbs sampling further make acceptance ratio being 100%

image.png

other features of Gibbs:

  • Do not need p(x)
image.png

Sampling on data stream

  • Bernoulli Sampling

  • Reservoir Sampling: not need to know stream size;
    image.png

Chapter 5

What is data stream?

A data stream is a massive sequence of data objects which have some unique features: One by One; Potentially Unbounded; Concept Drift

image.png

Challenges:

  • Single Pass Handling

  • Memory Limitation

  • Low Time Complexity

  • Concept Drift

What is concept drift?

The probability distribution changes.

image.png
image.png

Concept drift detection

  • Distribution-based detector

  • Error-rate based detector

image.png
image.png

Data stream classification

image.png

Typical algorithm

  • VFDT (very fast decision tree): A decision-tree learning system based on the Hoeffding tree algorithm

  • CVFDT (Concept-adapting Very Fast Decision Tree learner)

VFDT

image.png
image.png
image.png
image.png
image.png

CVFDT

image.png
image.png
image.png
image.png

Data stream clustering

  • Online phase: Summarize the data into memory-efficient data structures

  • Offline phase: Use a clustering algorithm to find the data partition (k-means, decision tree)

Framework

image.png
image.png
image.png
image.png

Chapter 6

Key node identification

image.png
image.png

PageRank

image.png
image.png

Community detection (graph clustering)

  • Minimum cut: find a graph partition such that the number of edges between the two sets is minimized.
    image.png

    But minimum cut always return an imbalanced partition.

  • Normalized cut & ratio cut
    image.png

    ,prefer a balanced partition.

  • modularity
    image.png
  • Random walk

  • Multi-level clustering

  • Dynamic community detection: a new viewpoint for community detection, the basic idea is Simulate the change of edge distances.

    • View network as dynamical system (Dynamic vs. Static)

    • Simulate the distance dynamics based on different interaction patterns (Distance dynamics vs. Node dynamics)

    • All edge distances will converge, and the community structure is intuitively identified.

Chapter 7

What is hadoop?

  • Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.

  • Hadoop is open-source implementation for Google MapReduce

  • Hadoop is based on a simple programming model called MapReduce

  • Hadoop is based on a simple data model, any data will fit

image.png
image.png
image.png

Core
Filesystems and I/O:

  • Abstraction APIs
  • RPC / Persistence

Avro
Cross-language serialization:

  • RPC / persistence
  • ~ Google ProtoBuf / FB Thrift

MapReduce
Distributed execution (batch)

  • Programming model
  • Scalability / fault-tolerance

HDFS
Distributed storage (read-opt.)

  • Replication / scalability
  • ~ Google filesystem (GFS)

Zoo keeper
Coordination service

  • Locking / configuration
  • ~ Google Chubby

HBase
Column-oriented, sparse store

  • Batch & random access
  • ~ Google BigTable

Pig
Data flow language

  • Procedural SQL-inspired lang.
  • Execution environment

Hive
Distributed data warehouse

  • SQL-like query language
  • Data mgmt / query execution

image.png
image.png
image.png
image.png
image.png

Limitation of MapReduce

  • Inefficient for multi-pass algorithm

  • No efficient primitives for data sharing

    • State between steps goes to distributed file system

    • Slow due to replication & disk storage

Spark

Apache Spark is a fast and general-purpose cluster computing system. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for streaming processing.

image.png
image.png
image.png
image.png
image.png
image.png
image.png

Row-key is unique for a row

image.png
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 216,039评论 6 498
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 92,223评论 3 392
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,916评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 58,009评论 1 291
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 67,030评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 51,011评论 1 295
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,934评论 3 416
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,754评论 0 271
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 45,202评论 1 309
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,433评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,590评论 1 346
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,321评论 5 342
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,917评论 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,568评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,738评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,583评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,482评论 2 352

推荐阅读更多精彩内容