The Problem
All over the world organizations store or have already accumulated during
their activity large volumes of data in corporate databases. These data "sets"
store major potential possibilities for extracting new, analytical information to help
build the business tactics to reveal market evolution tendencies
and to find new solutions stipulating successful development in conditions of competition.
It is obvious that for some companies such analysis is an integral part of their daily activity,
but most of them, at least here in Russia, are only starting to begin taking it seriously.
Historically, in Russia the solution for daily operating
activity automation was extremely advanced (I mine online transaction processing
systems, OLTP). These systems ensure registration and storage of the facts
describing companies' business processes. The base for such kinds of systems are
relational database management systems (RDBMS).
The traditional approach is to use created OLTP-systems for
support of accepted administrative solutions, i.e., attempts to construct
report sets "under" the corporate database and usage obtained (after interpreting)
reports directly for acceptance of strategic business solutions. Among
the disadvantages of such a method of decision-making is that it's necessary to
mark the following (as a minimum):
-
Small number of data used for support of crucial business-solutions;
-
Similar process is very slow, as the process of query writing and interpreting
them are durable. It takes many days when the chief solution
may be necessary to make immediately;
-
The problem of various spheres of activity. An expert in information technologies
(person who prepares the report) and the chief think in different categories
and as a corollary tend not to understand each other;
-
Complexity of reports for understanding (an absence of visualization);
-
Lack of time (for chief) to find "significant" numbers.
Actually, these problems are not a corollary of poor quality of the corporate
OLTP-systems. The problem is in the difference between activity which is automated
by the OLTP-systems and decision-making activity. Data stored in corporate
database systems are simply the records about some events having a place --
the facts, but not information in a general sense of this word.
E.F. Codd, the expert of technologies
of management relational systems, said: "... although RDBMS systems are
accessible for users, they never were considered as a resource giving power
functions on synthesis, the analysis and consolidation (so called, multidimensional
data analysis)". The question is synthesis of the information, on converting
the data of operating systems into the information and even in quality
standards.
The Solution
Automation or computerization, as a rule, happens historically from
"below to upwards". The defined divisions, which occur in mass operations
on data processing (product registration, the warehouse registration, dealing
with suppliers and/or clients), are exposed to the process of target
automation. This means that mass tasks will be made with the help of computers,
which will replace operation with paper media (documents, forms) on the
mode, in which operations are mirrored in databases, or (that more often
and happens) the information will be duplicated both on paper and in databases.
By virtue of that (or to reflection in the database) is exposed to a
computerization each time a small fragment of business processes, and it
happens not always in coordination. Most databases are "badly coordinated"
with one another. The phrase "badly coordinated" means that in various databases
identical "real world" objects may be reflected differently (for example,
Smith D. - at charge of wages and Don Smith - at obtaining material assets).
"Inside" corporate databases target transactions that are executed (for example,
filing the table of credited invoice). Such systems usually are called online
transaction processing systems (or OLTP). Target transactions change a
state of databases and bring them into accord to a current state of that
fragment of the real world which is simulated with the database. Thus,
the main assignment of target databases are transaction processing.
Another big layer of the operations fulfilled with corporate databases
is reports creation for the defined period of existence of the database.
The form of these reports may vary and, as a rule, is determined by external
rules (bookkeeping rules and laws) and internal rules (duty regulations,
orders and instructions) within the framework of which company works.
The databases created during such automation, as a rule, are optimized for
fast execution of necessary transactions and store the data for a small temporary
period. Now it is possible to ask ourselves: "What processes have undergone
automation? What operation of staff was automated and simplified?" The
answer for this question is obvious: operation of performers which fulfill
standard procedures are automated. Automation almost has not affected top-managers
- the staff responsible for decision making.
Creation of data warehouses - "deckhouses" above corporate databases
allow one to proceed to the next stage of business activity automation - creation
the platform for decision making support. The main difference of activity on
decision-making from performing daily, operational activity, from the point
data used, is the need for comprehensive vision of processes for all diversity
of parameters on which they depend, for various temporary intervals. It
is possible to say that performer work with the data on happening processes
whereas the information is necessary for decision making for managers.
For creation of decision support systems it is necessary complete, noncontradictory,
the information for various temporary intervals which may be as is generalized
(sum or aggregated in another way), and detailed. These requests are defining
at creation of data warehouses as bases for creation of decision support
systems (DSS).
For transition from the process of automation business of processes
to automation of decision-making process we use software products which
can be divided into three categories:
-
Reporting tools;
-
OLAP tools;
-
Data Mining tools.
Report generation tools are intended for data acquisition in tabular form,
diagrams and (geographical) maps. This class of tools allows managers to
inspect happening processes, having some of the fixed sights on metrics
of these processes.
OLAP tools are aimed at a test of hypotheses -- they allow one to find the
data which confirm or refute the formulated administrative hypotheses.
Hypotheses may be formulated as very much definitely (whether there is
a dependence between falling arrived for the last half-year and rises of
product cost), and is more indistinct (whether there are parameters which
distinguish the division which has brought the greatest profit on all remaining
company divisions). This information allows managers to change processes
of firm for reaching definite purposes.
Data mining tools are intended for checking, finding and confirming hypotheses
based on the existing data. These tools hardly depend on
data domain and structure of input data. However the availability of similar
tools are necessary in case of the big data volumes and a plenty of parameters,
on which these data depend, as they allow to detect (or other words, to
make visible, mapped) the facts and tendencies which are completely invisible
at usual review of huge data arrays.
Specific data usage forms (against transaction processing "inside" OLTP
databases) generate appropriate requirements to storage and data presentation
models. Data warehouses and OLTP-databases run under different query types
(or load templates). Corporate OLTP databases are as much as possible optimized
for short transactions series execution with maximum performance. Target
transactions combine not only data reading, but also data modification
and new data insertion. In case of data warehouses we deal, at first, with
data acquisition, a great bulk of queries to data warehouse are reading
/ sampling data. The most widespread model for data warehouses is the three-dimensional
model or model of multidimensional cubes.
Attempts to construct analysis / decision making systems which would
access directly to OLTP-databases usually appear unsuccessful:
-
Analytical queries "compete" with online transactions, locking the data
and causing shortage of database server resources;
-
Operating data structure / scheme intended for effective support of short
and often transactions (it is normalized as a minimum up to the third normal
form) and by virtue of it:
-
Too complicated for understanding by end-users;
-
Does not ensure a necessary execution performance of analytical queries.
-
As a rule in a company there are several operating databases. This database
has various data structures, units of measurements, methods of encoding.
For the end user (business - analyst) the task of construction of any summary
query on several similar databases are practically insoluble.
Multidimensional data models allow to store measure values (for example,
the sum of sale, an amount of the goods, an amount of shipped products)
in the uniform object of the data with defining parameters or arguments;
further we shall name these objects as measures (for example, number of
the score, date of posting, number of the waybill, the name of the payer,
the name and the type of the goods, a title of service). It is necessary
to mark, that the data for storage are received from target online corporate
databases. Data warehouses contain the information on the various sides
of processes which might be mirrored in various target databases in the
uniform object, the process of data multiplexing from various sources,
or the process of consolidation is necessary.
Most measures (parameters on which measures depend) are inhomogeneous.
For example, if we speak about the description of the goods, it is also
possible to speak about some commodity section, the type of the goods.
Thus, measure "product" may have a name hierarchy, for example "product
type" - > "product kind" - > "product unit name". At stage of data analysis
and decision making often it is necessary to work not with the numerical
values specific to concrete production; there is enough detailing a level
of sort of production or even its(her) type. For support of fast access
to this sum the information of repository, as a rule, contain this information
already calculated. The process of calculation of the similar sums are
we named - aggregation, the calculated values - aggregates.
Data warehouse structures noticeably differ from those used in OLTP-systems.
The typical data structure used in warehouses are :
-
Star schema;
-
Snowflake schema.
These schemes for data warehouses are the same canonical, as third normal
form for online transaction processing systems (OLTP).
In a basis of OLAP concept lays the principle of a multidimensional
data presentation. "Father of the relational theory" doctor E.F. Codd has
considered disadvantages of relational model, first having showed impossibility
"to unite, view and parse the data viewed from multiplicity of measurements,
that is the most understandable for corporate analysts a method", and has
defined common requirements to OLAP tools which are expanding functionality
of relational DBMS and switching on the multidimensional analysis as one
performance.
These 12 rules (according to Codd) which should satisfy the OLAP software are:
| Multi-Dimensional Conceptual View |
Business-analyst "sees the company world" multivariate and multi-dimensional,
accordingly and conceptual data model representation in OLAP product should
be multivariate and multi-dimensional on a nature, that allow analysts
to fulfil intuitive operations: "slice and dice", rotate and pivot directions
of consolidation. |
| Transparency |
The user should not know what concrete resources are used for storage
and data processing and how the data are organized.
Without dependence from that, the OLAP-product a part of resources
of the user is whether or not, this fact should be transparent for the
user. If OLAP it is granted by client - server calculations this fact also,
whenever possible, should be imperceptible for the user. OLAP should be
granted in a context of open architecture, allowing the user where he was
to communicate through the analytical tool with the server. In addition
transparency should be achieved in interaction of the analytical tool with
homogeneous / heterogeneous databases. |
| Accessibility |
Business analyst should have a possibility to analyze within the framework
of the common conceptual scheme, thus the data may remain under the control
of old, "inherited" DBMS, being thus pegged to common analytical model.
So OLAP tool kit should superimpose its own logic scheme on physical
data arrays, fulfilling all conversions required for support of an uniform,
agreed and complete "user sight" on the information. |
| Consistent Reporting Performance |
With increasing of numbers of measures and database size analysts should
not face with any decrease of productivity. Stable productivity is necessary
for maintaining an usage simplicity which is required for finishing OLAP
up to the end user.
If the user - analyst will test essential distinctions in productivity
according to number of measures then he will try to compensate these distinctions
the strategy of development that will call data representation other ways,
but not with what it is really necessary to present the data. Costs of
time to bypass the system for compensation of its inadequacy is not what
analytical products are intended for. |
| Client-Server Architecture |
Large data volumes, required operating analytical processing stored
on mainframes, but extracted from PC. Therefore one of requests - ability
of OLAP products to operate in client - server environment. Main idea here
is that OLAP tool server component should be intelligent enough and can
build the common conceptual scheme based on generalization and consolidations
of various logical and physical schemes of corporate databases. |
| Generic Dimensionality |
All measures should be equivalent. Additional performances may be given
to separate measures, but as all of them are symmetric, the given additional
functionality may be given to any measure. Base data structure, formulas
and report formats should not base on any one measurement and should not
be displaced aside to any measure. Each measure should be applied irrespectively
to its structure and operational abilities. Additional operational abilities
may be granted to any selected measure, and as measures are symmetric,
any function may be given to any measure. |
| Dynamic Sparse Matrix Handling |
OLAP tool should guarantee optimal processing of the sparse matrixes.
Access speed should be saved without dependence from data cells layout
and to be a constant for the models having different number of measures
and different data sparse. |
| Multi-User Support |
Frequently some analysts have the necessity to work simultaneously
with one analytical model or to create various models based on the same
data. OLAP tool should grant them competitive access, guarantee integrity
and data protection. |
| Unrestricted Cross-dimensional Operations |
Data calculation and manipulation on any number of measures should
not prohibit or limit any ratios among data cells. The conversions requiring
arbitrary definition, should be set in functionally complete formula language. |
| Intuitive Data Manipulation |
Directions consolidation, detailing data in columns and rows, aggregation
and other data manipulations inherent to hierarchy structure , should be
executed in maximum convenient, natural and comfortable user interface. |
| Flexible Reporting |
Various methods of data visualization should be supported, other word
reports should be presented in any possible orientation. |
| Unlimited Dimensions and Aggregation Levels |
Strongly recommended, that each serious OLAP tool should have a minimum
of 15 (better more than 20 measures in analytical model. Moreover, each
of these measures should admit practically unlimited amount of aggregation
levels, defined by user, on any direction of consolidation. |
It is necessray to consider this set of requests, being the actual definition of OLAP (by E. F. Kodd),
as recommendations, and evaluate concrete
OLAP-products on degree of approximation to correspondence to all above
12 rules.
This completes the first part (from a series) of the articles
concerning OLAP technology -- technologies of data analysis, the "platform"
which allows modern organizations in the modern world to build the business-tactics
to reveal market evolution tendencies and to find new solutions stipulating
successful development in conditions of competition.
Later, in my future articles you can find a description of utilities
/ tools, related to OLAP technology:
-
Data transformation tools;
-
Data analysis tools;
-
OLAP-administration tools.
Ideally, in my future articles, I would like to proceed from the theoretical
description of OLAP to a more practical presentation and the technical
application to a solution of those practical tasks which are put before us in our
our businesses. I hope to make it with your help and your advice.
See All Articles by Columnist Alexzander Nepomnjashiy