数据仓库_数据粒度_数据分区

Granularity

Granularity refers to the level of detail or summarization of the units of data in the data warehouse.

The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity.

Granularity is the single most critical design issue in the data warehouse environment because it profoundly affects the volume of data that resides in the data warehouse and the type of query that can be answered.

The volume of data in a warehouse is traded off against the level of detail of a query.

In almost all cases, data comes into the data warehouse at too high a level of granularity. This means that the developer must spend a lot of design and development resources breaking the data apart before it can be stored in the data warehouse. Occasionally, though,data enters the warehouse at too low a level of granularity.

The Benefits of Granularity

The granular data found in the data warehouse is the key to reusability, because it can be used by many people in different ways.

Looking at the data in different ways is only one advantage of having a solid foundation. A related benefit is the ability to reconcile data, if needed.

Another related benefit of a low level of granularity is flexibility

Another benefit of granular data is that it contains a history of activities and events across the corporation.

Determining the level of granularity is the most important design issue in the data warehouse environment

An Example of Granularity

There is, then, a very good case for the compaction of data in a data warehouse. When data is compacted, significant savings can be realized in the amount of DASD used, the number of index entries required, and the processor resources required to manipulate data.

Put another way, with a very low level of granularity, you can answer practically any query. But a high level of granularity limits the number of questions that the data can handle.

Another consideration in designing granularity is determining which architectural entities will feed off the data warehouse.Each DSS architectural entity has its own unique considerations. The data warehouse must be designed to

feed the lowest level of granularity needed by any architectural entity.

Dual Levels of Granularity

Most of the time, there is a great need for efficiency in storing and accessing data, and for the ability to analyze data in great detail.

The data warehouse in this example contains two types of data—lightly summarized data and “true archival” detail data.

Light summarization data is detailed data that has been summarized only to a very small extent. For example, phone call information may be summarized by the hour. Or, bank checking information may be summarized by the day.

There is a significantly less volume of data in the lightly summarized database than there is in the detailed database.Of course, there is a limit to the level of detail that can be accessed in the lightly summarized database.

At the true archival level of data, all the detail coming from the operational environment is stored. There is truly a multitude of data at this level. For that reason, it makes sense to store the data on a medium such as magnetic tape or another bulk storage medium because the volume of data is so large.

If a pattern of searching the true archival level of data develops over time, the designer may want to create some new fields of data at the lightly summarized level, so that most of the processing can occur there.

Living Sample Database

The greatest asset of a living sample database is that it is very efficient to

access. Because its size is a fraction of the larger database from which it was

derived, it is correspondingly much more efficient to access and analyze.

If very high degrees of accuracy are desired, a useful technique is to formulate the request and go through the iterative processing on the living sample database. In doing so, the DSS analyst quickly formulates the request. Then, after several iterations of analysis have been done, when the request is understood, it is run one final time against the large database.

Partitioning as a Design Approach

A second major design issue of data in the warehouse (after granularity) is partitioning

Partitioning of data refers to the break up of data into separate physical units that can be handled independently.

It is often said that if both granularity and partitioning are done properly, then almost all other aspects of the data warehouse design and implementation come easily.

Data is partitioned when data of a like structure is divided into more than one physical unit of data. In addition, any given unit of data belongs to one and only one partition.

The choices for partitioning data are strictly up to the developer. In the data warehouse environment, however, it is almost mandatory that one of the criteria for partitioning be by date.

As a rule, it makes sense to partition data warehouse data at the application level.

The acid test for the partitioning of data is to ask the question, “Can an index be added to a partition with no discernible interruption to other operations?” If an index can be added at will, then the partition is fine enough. If an index cannot be added easily, then the partition needs to be broken down more finely.

数据仓库_数据粒度_数据分区

推荐阅读更多精彩内容