COMP9318_WEEK2

声明:由于本人也是处于学习阶段,有些理解可能并不深刻,甚至会携带一定错误,因此请以批判的态度来进行阅读,如有错误,请留言或直接联系本人。

本周内容参照Jiawei.Han&Micheline.Kamber&Jian.Pei, DATA MINING: Concepts and Techniques, Third Edition. 版本的部分内容。

本周内容:1)Data Warehouse定义;2)OLAP介绍;3)Data Cube介绍

关键词:Data Warehouse; OLAP; Data Cube; Lattice; Roll-up; Drill-down; Slice and Dice; Pivot

首先提出问题:

  1. 什么是Data warehouse?

2. OLAP是什么?与Data Warehouse有什么关系?

问题一:什么是Data warehouse?

关于Data Warehouse 这里有很多版本的定义,我们这里采用William H. Inmon的定义:

W. H. Inmon对Data Warehouse的定义:A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.

特点一,subject-oriented(面向主题):

1)Organized around major subjects, such as customer,product, sales.

2) Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

3) Provide a simpleand conciseview around particular subject issues by excluding data that are not useful in the decision support process.

(Data Warehouse 面向的数据是有侧重点的的;且它关注的是数据的模型搭建和分析,以帮助决策者来进行决策;它针对具体的主题通过简洁明了的视图来使决策者能够明了的看见有用的信息)

特点二,integrated(整合):

1)Constructed by integrating multiple, heterogeneous data sources

1.1)relational databases, flat files, on-line transaction records

2)Data cleaning and data integration techniques are applied.

2.1)Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources.(E.g., Hotel price: currency, tax, breakfast covered, etc.)

2.2)When data is moved to the warehouse, it is converted.

(Data Warehouse 将复杂的异构的数据整合起来; 这里会用到数据清洗和数据整合技术,来确保数据结构合理正确(naming conventions, encoding structures, attribute measures))

特点三,time-variant(时变):

1)The time horizon for the data warehouse is significantly longer than that of operational systems.

1.1)Operational database: current value data.

1.2)Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

2)Every key structure in the data warehouse

2.1)Contains an element of time, explicitly or implicitly

2.2)But the key of operational data may or may not contain “time element”.

(Data Warehouse 的存在时间是久于Operational database的,所以它会存储很多历史性数据,因此我们需要在Data Warehouse的关键结构点(key structure)上显式的或隐式的存储时间标记(time element),所以Data Warehouse是具有时间性的。)

特点四,nonvolatile(非易失性):

1)A physically separate store of data transformed from the operational environment.

2)Operational update of data does not occur in the data warehouse environment.

2.1)Does not require transaction processing, recovery, and concurrency control mechanisms

2.2)Requires only two operations in data accessing:

2.2.1)initial loading of data and access of data.

(Data Warehouse 的数据存储是独立的,且它不会实时更新数据(Does not require transaction processing, recovery, and concurrency control mechanisms),它只需要能够初始化加载数据和访问数据就行。因此在DW中不需要考虑数据冗余(data redundancy),因为即使一个个体的汇总数据是在不断更新的,但是DW不会覆盖历史数据,只会不断的存入该汇总数据。例如,Woolworths在Sydney的A2牛奶的销售汇总数据每年都是不一样的,DW会将1990年...,2015年,2016年,2017年的销售汇总数据存储起来,而不是每年都会用新数据覆盖旧数据。所以说,在DW中的data不能被删除也不能被修改。)

总结:

Data Mining:concepts and techniques P.106总结理一下William H. Inmon关于Data Warehouse的定义:In sum, a data warehouse is a semantically consistent data store that serves as a physical implementation of a decision support data model and stores the information on which an enterprise needs to make strategic decisions. A data warehouse is also often viewed as an architecture, constructed by integrating data from multiple heterogeneous sources to support structured and/or ad hoc queries, analytical reporting, and decision making.

在PPT第11页,老师提供了个图,来帮助理解Data Warehouse Architecture


image.png

(1)外部数据通过Extract, Transform, Load, Refresh这些步骤,将数据进行clean , match, aggregate, 然后放入DW。(2) 现在,DW已经是所有subject的data一个统一体(我们还可以进行细分,将DW分成若干个Data Marts(这些mart都是针对于specific subject的))。然而在DW中,那我们用什么工具来帮助分析处理这些数据呢?当然是OLAP。(3)经过OLAP的分析处理后,我们还可以用这些已经比较直观的数据再次进行analysis, query, data Mining 来获取更加深层次的隐藏的信息。

问题二:OLAP是什么?与Data Warehouse 有什么关系?

  1. OLAP是什么:

1.1)(来自Wikipedia)

Online analytical processing, or OLAP , is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing.[1] OLAP is part of the broader category of business intelligence, which also encompasses relational database, report writing and data mining.[2] Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM),[3] budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture.[4] The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP).[5]

OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.[6]

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time.[7]

1.2)来自DATA MINING

Data warehouse technology includes data cleaning, data integration, and on-line analytical processing (OLAP), that is, analysis techniques with functionalities such as summarization, consolidation, and aggregation as well as the ability to view information from different angles.(P.3)

(注:韩教授书中对于OLAP的操作分类略有不同:Typical OLAP operations include rollup, drill-(down, across, through), slice-and-dice, pivot (rotate), as well as statistical operations such as ranking and computing moving averages and growth rates. OLAP operations can be implemented efficiently using the data cube structure.(DATA MINING P.151) 本人更加倾向于韩教授的分类)

下面是关于Roll-up,Drill-down的解释:

Roll-up, Drill-down是OLAP的操作之一,它们允许用户在不同的汇总级别观察数据。例如,按地级市汇总的销售数据,用户可以Rool-up得到按省汇总的销售数据,可以Drill-down得到按县(区)汇总的销售数据。

如下图可直观的感受到Roll-up和Drill-down的操作。

image.png

2)OLAP与Data Warehouse 有什么关系?

Data warehouses provide on-line analytical processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities, which facilitates effective data generalization and data mining.(Many other data mining functions, such as association, classification, prediction, and clustering, can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of abstraction. Hence, the data warehouse has become an increasingly important platform for data analysis and on-line analytical processing and will provide an effective platform for data mining. )(DATA MINING P.105)

新的问题产生:

3.Data Warehouse和传统的DBMS有什么不同呢?

4. 相较于OLTP, OLAP具有什么优势?

首先我们给OLTP下个定义:The major task of on-line operational database systems is to perform on-line transaction and query processing. These systems are called on-line transaction processing (OLTP) systems. They cover most of the day-to-day operations of an organization, such as purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.(DATA MINING P.108)

问题三:Data Warehouse和传统的DBMS有什么不同呢?

(1)High performance for both systems

(1.1)DBMS— tuned for OLTP: access methods, indexing, concurrency control, recovery

(1.2)Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

(2)Different functions and different data:

(2.1)missing data: Decision support requires historical data which operational DBs do not typically maintain

(2.2)data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources

(2.3)data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.

课程PPT解释的太过笼统,下面请看韩教授对于它们不同点的详细解释:

1)Users and system orientation: An OLTP system is customer-oriented and is used for transaction and query processing by clerks, clients, and information technology professionals. An OLAP system is market-oriented and is used for data analysis by knowledge workers, including managers, executives, and analysts.

2)Data contents: An OLTP system manages current data that, typically, are too detailed to be easily used for decision making. An OLAP system manages large amounts of historical data, provides facilities for summarization and aggregation, and stores and manages information at different levels of granularity. These features make the data easier to use in informed decision making.

3)Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an application-oriented database design. An OLAP system typically adopts either a star or snowflake model (to be discussed in Section 3.2.2) and a subjectoriented database design.

4)View: An OLTP system focuses mainly on the current data within an enterprise or department, without referring to historical data or data in different organizations. In contrast, an OLAP system often spans multiple versions of a database schema, due to the evolutionary process of an organization. OLAP systems also deal with information that originates from different organizations, integrating information from many data stores. Because of their huge volume, OLAP data are stored on multiple storage media.

5)Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions. Such a system requires concurrency control and recovery mechanisms. However, accesses to OLAP systems are mostly read-only operations (because most data warehouses store historical rather than up-to-date information), although many could be complex queries.(DATA MINING P.108-109)

image.png

问题四:相较于OLTP, OLAP具有什么优势?

(1)Different workload:

(1.1)OLTP (on-line transaction processing)

(1.1.1)Major task of traditional relational DBMS

(1.1.2)Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

(1.2)OLAP (on-line analytical processing)

(1.2.1)Major task of data warehouse system

(1.2.2)Data analysis and decision making

(2)Queries hard/infeasible for OLTP, e.g.,

(2.1)Which week we have the largest sales?

(2.2)Does the sales of dairy products increase over time?

(2.3)Generate a spread sheet of total sales by state and by year.

(3)Difficult to represent these queries by using SQL

继续提问:5. 什么是multidimensional Model呢?

问题五:什么是multidimensional Model呢?

Data warehouses and OLAP tools are based on a multidimensional data model. This model views data in the form of a data cube. A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.(DATA MINING P.110)

好的,现在我们肯定有疑问,什么是fact?什么是measure?什么是dimension?什么是data cube?让我们一一解答。

什么是fact?:(PPT定义)Facts: the subject it models;(DATA MING P.111定义)Facts are numerical measures. (个人总结:fact就是用数字数据来度量的subject。)

什么是measure?:A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.(DATA MINING P.119)

什么是dimension?:Dimensions: context of the measure(Dimensions are the perspectives or entities with respect to which an organization wants to keep records. (DATA MINING P.110))让我们用实例来说明dimension。

image.png

在这个cube中,它的dimensions其实是product,location,time,那么为什么我们看到的图的三个维度分别是product,month,city呢?因为每个dimension是分hierarchy的(就如这里的city,如果我们使用OLAP工具进行操作的话,它可以roll-up到sate,也可以drill-down到stor。 注意,OLAP操作和herarchy不是一类事物,详见DATA MINING P.121)。因此,我们需要给每个dimension一个类似于总称的东西。

什么是data cube?:A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.(DATA MINING P.110)(个人总结:data cube是由Tables and Spreadsheets 转化而来的更为直观立体的数据表现形式。)

在ppt22页有这张图,让我们对其进行探究。

image.png

这里的蓝色区域其实表示所有product的汇总,正如绿色方框内文字所表示的,箭头所指小方框表示“所有product在一月份在纽约的销售量”。

由此,我们可以根据如下图,得到下下一张的汇总图:


image.png

image.png

我们将汇总图局部拆解,显得更加直观


image.png

依此进行组合就可以得到上图的一个汇总图。

好的,我们还可以Lattice of the cuboids,


image.png

这张图我们可以理解为,在1层时(从下往上,0层开始)为点,2层为面,3层为立方体。依此往上,我们可以很方便的将大于3维的cube直观的展现在我们眼前。

下面我们讲一下,分析一下Common OLAP Operations,由于前面我们已经分析过Roll-up和drill-down,所以这里我们不再赘述。(详细说明可参照DATA MINING P.125)

Slice:The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.


image.png

Dice:The dice operation defines a subcube by performing a selection on two or more dimensions.


image.png

Pivot:Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cubeinto a series of 2-D planes.


image.png
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,324评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,303评论 2 381
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 150,192评论 0 337
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,555评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,569评论 5 365
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,566评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,927评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,583评论 0 257
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,827评论 1 297
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,590评论 2 320
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,669评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,365评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,941评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,928评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,159评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,880评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,399评论 2 342

推荐阅读更多精彩内容