Welcome to the follow-up of the first part in Data Warehousing Technology Review. Let’s jump right in.
So, the main idea underlying Data Warehousing is that carrying out
the analysis directly on OLTP-systems is ineffective and undesirable.
Instead, the idea is to extract the necessary data from several OLTP-systems,
convert it, and then place it in a single data source — a Data Warehouse. This
process is also known as the ETL (Extract Transform Load)
process.
During the ETL process, the data is:
Cleared (unnecessary or housekeeping information is eliminated);
Aggregated (the sum and average values are calculated);
Transformed (data type conversion and storage structures reorganization
are executed);
Unified (from various data sources); and
Synchronized (resulting in one instant of time).
Data Warehouse is a ready platform for DSS (Decision Support Systems) and OLAP systems construction as it contains data having the following
properties:
Internal integrity
Although the data is preempted from several independent operating data
sources (OLTP-systems), they should be joined by unified naming laws,
values measurement ways, and common attribute sets. It has the highest value in
those organizations where a number of various systems are exploited and/or
similar data is presented in different ways (for example, logical value «True»
may be stored as 1,-1, True or T. Other variants of this include the same metric within
several systems being named differently – for example «balance»
and «totals ».)
At the stage of storage data loading, the above mentioned mismatches should
be eliminated (i.e. the data should be checked out, cleared and presented
in a uniform view). The Mandatory execution of such a procedure allows us to avoid
many potential problems. Also, the analysis of such integrated data can be performed
much easier.
«Single subject» focus
OLTP systems contain gigabytes of data that are rarely interesting for analysis
— addresses, postal codes, record identifiers, etc. Such information
is typically not loaded into a warehouse, which limits the considered data
spectrum to a necessary minimum.
The information in DW-storage is organized according to the primary aspects
of company business activity (customers, sales, etc.); this distinguishes
Data Warehouses from operating databases where the data are organized according
to business processes (invoice shipment, calculations with customers, goods
shipment, etc.). Subject orientation (due to corresponding data structures)
promotes both analysis simplification and faster execution of analytical queries.
Historic nature
OLTP systems usually cover relatively limited time intervals, whereas
Data Warehouses frequently contain data for up to several decades. This
makes them an ideal base for revealing trends and long-term tendencies in business.
From a technical point of view, the historical nature of data means that
tables in storage contain a «temporary key», or data are distributed
under several tables, each of them referring to a fixed-time interval.
«Read-only» availability
Depending of business needs, you choose one of three available methods
to update data in a Warehouse Full Process – Completely restructures
a cube based on its current definition and then its data recalculation;
Incremental Update – Adds new data to a partition in the cube and
aggregations updating. This method does not process changes to a cube’s
structure (measures, dimensions) or changes to its existing source data.
An incremental update creates a temporary partition from the new data and
merges it into an existing partition); and Refresh Data – Clears and
reloads a cube’s data and recalculates its aggregations. This method is used
if the cube’s source data has changed but its structure has not.
The
hand-held data modification in storage is practically never performed (actually
Microsoft SQL Server 7.0 OLAP Services cubes may be updated by the user
(so called «write-back capabilities»), as it results in data
integrity violation. Typically, it is not required to minimize the time of DW
data loading, so the storage structure may be optimized for predefined
queries execution, achieved by relational scheme de-normalization,
preliminary data aggregation and, «of most use», indexes creation.
Thus, the data loaded in DW-storage are organized in a completely integrated
structure, have natural internal connections, and gain new properties that
add them the status of INFORMATION.