统计

CH1 Data mining

Major data mining tasks

  1. Classication and regression

    • Classication predicts categorical attribute values;
    • regression predicts numerical attribute values
  2. Cluster analysis

Given a set of objects, each having a set of attributes, and a
similarity measure among them, nd clusters (i.e., groups) such
that

  • objects in one cluster are more similar to one another
  • objects in separate clusters are less similar to one another
    unlike classication, clustering analyzes objects without
    consulting a known class label
  1. Association analysis

Given a transactional database, nd the sets of objects that
frequently appear within the same transactions
also called frequent pattern mining

Various data repositories

  • relational data
  • data warehouses
  • transactional data
  • graph data
  • sequence data
  • time series
  • spatial data
  • text & multimedia data

CH2a Data preprocessing

-noisy
-inconsistent
-redundant

Data preprocessing tasks

  • types of attributes
    • Categorical
      - nominal: provide enough information to distinguish one object from another
      Example zip codes, employee ID numbers, eye color, gender
      - binary: assume only two values (e.g., yes/no, true/false, 0/1)
      - ordinal: provide enough information to order objects
      Example grades, fgood,better,bestg
    • Numeric (continuous)
  • descriptive data summarization
    gives the overall picture of the data
    involves
    • measuring the central tendency
      • mean
        The mean is sensitive to extreme values
      • weighted mean
      • Trimmed mean: disregards the low and high extremes
      • a measure that is not sensitive to extreme values is the
        median, which represents the middle value of an ordered set
        of observations
      • mode: the value that occurs most frequently in the set
      • midrange: average of the largest and smallest values in the
        data
    • measuring the dispersion
      - range: di�erence between the largest and smallest value
      - kth percentile: value xi with the property that k percent of
      the data are smaller than xi (what percentile is the median?)
      - quartiles: 25th percentile (denoted by Q1), 50th percentile,
      and 75th percentile (denoted by Q3)
      - interquartile range:
      IQR = Q3 - Q1
      - five number summary: consists of minimum, Q1, median, Q3,
      maximum
      - standard deviation : square root of variance ^2
    • graphical display of descriptive summaries
      • boxplots
      • histograms
      • scatter plots
  1. Data cleaning
    fill in missing values
    e.g., Occupation="
    smooth out noise, containing errors or outliers
    faulty data collection instruments
    human or computer error at data entry
    errors in data transmission

    outlier: usually, a value higher/lower than 1.5 x IQR
    e.g., Salary = -10"
    correct inconsistencies in the data
    e.g., Age = \42", Birthday = \03/07/2010"
    e.g., discrepancy between duplicate records

Given N tuples, are numerical attributes A and B correlated?


图片.png
  1. Data integration
    Data integration combines data from multiple sources into a coherent data store

Entity identification problem
Do two objects from different data sources refer to the same entity?
Example Is the record that has customer id = 234 (from one source) equivalent to that where cust num = 234 (from the other source)?
Metadata can help e.g., for each attribute, look at the name, meaning, data type, range of values permitted, etc

data value conflicts
For the same entity, attribute values from different sources may differ e.g., weight measured in kilograms or pounds

data redundancy

  1. Data transformation
    (Goal: modify the data in order to improve data mining performance)
  2. Data reduction

attribute/feature construction

normalization: scaled to fall within a smaller, specied range

min-max normalization

z-score normalization

Data reduction

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容

  • rljs by sennchi Timeline of History Part One The Cognitiv...
    sennchi阅读 12,129评论 0 10
  • 第一次在简书写东西, 也不知道写啥好了, 不过觉得这个地方比朋友圈啦微博啦不知道高到哪里去了. 只有一个不好, 就...
    5779cc3e3627阅读 1,007评论 0 0
  • 当时的W先生出于一种什么样的目的频繁地作出邀请,我又是基于何种原因屡次应邀,直至身边所有人都觉得我们的关系不一般呢...
    弈之翼阅读 2,087评论 0 1
  • 刚刚过去的周末,女儿独自跟着一位阿姨去参加了露营,体会到了在空中草原睡觉,据说晚上看星星,满天的繁星,还看到了流星...
    蜗牛小于阅读 3,192评论 0 0
  • 新年快乐 事事顺心 永远快乐
    周游的野狗阅读 47评论 0 0