OLAP and Data Warehousing (The Problem and Solution)December 24, 2001
All over the world organizations store or have already accumulated during
their activity large volumes of data in corporate databases. These data "sets"
store major potential possibilities for extracting new, analytical information to help
build the business tactics to reveal market evolution tendencies
and to find new solutions stipulating successful development in conditions of competition.
It is obvious that for some companies such analysis is an integral part of their daily activity,
but most of them, at least here in Russia, are only starting to begin taking it seriously.
Historically, in Russia the solution for daily operating
activity automation was extremely advanced (I mine online transaction processing
systems, OLTP). These systems ensure registration and storage of the facts
describing companies' business processes. The base for such kinds of systems are
relational database management systems (RDBMS).
E.F. Codd, the expert of technologies
of management relational systems, said: "... although RDBMS systems are
accessible for users, they never were considered as a resource giving power
functions on synthesis, the analysis and consolidation (so called, multidimensional
data analysis)". The question is synthesis of the information, on converting
the data of operating systems into the information and even in quality
standards.
Automation or computerization, as a rule, happens historically from
"below to upwards". The defined divisions, which occur in mass operations
on data processing (product registration, the warehouse registration, dealing
with suppliers and/or clients), are exposed to the process of target
automation. This means that mass tasks will be made with the help of computers,
which will replace operation with paper media (documents, forms) on the
mode, in which operations are mirrored in databases, or (that more often
and happens) the information will be duplicated both on paper and in databases.
By virtue of that (or to reflection in the database) is exposed to a
computerization each time a small fragment of business processes, and it
happens not always in coordination. Most databases are "badly coordinated"
with one another. The phrase "badly coordinated" means that in various databases
identical "real world" objects may be reflected differently (for example,
Smith D. - at charge of wages and Don Smith - at obtaining material assets).
"Inside" corporate databases target transactions that are executed (for example,
filing the table of credited invoice). Such systems usually are called online
transaction processing systems (or OLTP). Target transactions change a
state of databases and bring them into accord to a current state of that
fragment of the real world which is simulated with the database. Thus,
the main assignment of target databases are transaction processing.
Another big layer of the operations fulfilled with corporate databases
is reports creation for the defined period of existence of the database.
The form of these reports may vary and, as a rule, is determined by external
rules (bookkeeping rules and laws) and internal rules (duty regulations,
orders and instructions) within the framework of which company works.
The databases created during such automation, as a rule, are optimized for
fast execution of necessary transactions and store the data for a small temporary
period. Now it is possible to ask ourselves: "What processes have undergone
automation? What operation of staff was automated and simplified?" The
answer for this question is obvious: operation of performers which fulfill
standard procedures are automated. Automation almost has not affected top-managers
- the staff responsible for decision making.
Creation of data warehouses - "deckhouses" above corporate databases
allow one to proceed to the next stage of business activity automation - creation
the platform for decision making support. The main difference of activity on
decision-making from performing daily, operational activity, from the point
data used, is the need for comprehensive vision of processes for all diversity
of parameters on which they depend, for various temporary intervals. It
is possible to say that performer work with the data on happening processes
whereas the information is necessary for decision making for managers.
For creation of decision support systems it is necessary complete, noncontradictory,
the information for various temporary intervals which may be as is generalized
(sum or aggregated in another way), and detailed. These requests are defining
at creation of data warehouses as bases for creation of decision support
systems (DSS).
For transition from the process of automation business of processes
to automation of decision-making process we use software products which
can be divided into three categories:
OLAP tools are aimed at a test of hypotheses -- they allow one to find the
data which confirm or refute the formulated administrative hypotheses.
Hypotheses may be formulated as very much definitely (whether there is
a dependence between falling arrived for the last half-year and rises of
product cost), and is more indistinct (whether there are parameters which
distinguish the division which has brought the greatest profit on all remaining
company divisions). This information allows managers to change processes
of firm for reaching definite purposes.
Data mining tools are intended for checking, finding and confirming hypotheses
based on the existing data. These tools hardly depend on
data domain and structure of input data. However the availability of similar
tools are necessary in case of the big data volumes and a plenty of parameters,
on which these data depend, as they allow to detect (or other words, to
make visible, mapped) the facts and tendencies which are completely invisible
at usual review of huge data arrays.
Specific data usage forms (against transaction processing "inside" OLTP
databases) generate appropriate requirements to storage and data presentation
models. Data warehouses and OLTP-databases run under different query types
(or load templates). Corporate OLTP databases are as much as possible optimized
for short transactions series execution with maximum performance. Target
transactions combine not only data reading, but also data modification
and new data insertion. In case of data warehouses we deal, at first, with
data acquisition, a great bulk of queries to data warehouse are reading
/ sampling data. The most widespread model for data warehouses is the three-dimensional
model or model of multidimensional cubes.
Attempts to construct analysis / decision making systems which would
access directly to OLTP-databases usually appear unsuccessful:
Most measures (parameters on which measures depend) are inhomogeneous.
For example, if we speak about the description of the goods, it is also
possible to speak about some commodity section, the type of the goods.
Thus, measure "product" may have a name hierarchy, for example "product
type" - > "product kind" - > "product unit name". At stage of data analysis
and decision making often it is necessary to work not with the numerical
values specific to concrete production; there is enough detailing a level
of sort of production or even its(her) type. For support of fast access
to this sum the information of repository, as a rule, contain this information
already calculated. The process of calculation of the similar sums are
we named - aggregation, the calculated values - aggregates.
Data warehouse structures noticeably differ from those used in OLTP-systems.
The typical data structure used in warehouses are :
In a basis of OLAP concept lays the principle of a multidimensional
data presentation. "Father of the relational theory" doctor E.F. Codd has
considered disadvantages of relational model, first having showed impossibility
"to unite, view and parse the data viewed from multiplicity of measurements,
that is the most understandable for corporate analysts a method", and has
defined common requirements to OLAP tools which are expanding functionality
of relational DBMS and switching on the multidimensional analysis as one
performance.
These 12 rules (according to Codd) which should satisfy the OLAP software are:
It is necessray to consider this set of requests, being the actual definition of OLAP (by E. F. Kodd), as recommendations, and evaluate concrete OLAP-products on degree of approximation to correspondence to all above 12 rules. This completes the first part (from a series) of the articles concerning OLAP technology -- technologies of data analysis, the "platform" which allows modern organizations in the modern world to build the business-tactics to reveal market evolution tendencies and to find new solutions stipulating successful development in conditions of competition. Later, in my future articles you can find a description of utilities / tools, related to OLAP technology:
|