The Problem
All over the world organizations store or have already accumulated during their activity large volumes of data in corporate databases. These data “sets” store major potential possibilities for extracting new, analytical information to help build the business tactics to reveal market evolution tendencies and to find new solutions stipulating successful development in conditions of competition. It is obvious that for some companies such analysis is an integral part of their daily activity, but most of them, at least here in Russia, are only starting to begin taking it seriously.
Historically, in Russia the solution for daily operating activity automation was extremely advanced (I mine online transaction processing systems, OLTP). These systems ensure registration and storage of the facts describing companies’ business processes. The base for such kinds of systems are relational database management systems (RDBMS).
The traditional approach is to use created OLTP-systems for support of accepted administrative solutions, i.e., attempts to construct report sets “under” the corporate database and usage obtained (after interpreting) reports directly for acceptance of strategic business solutions. Among the disadvantages of such a method of decision-making is that it’s necessary to mark the following (as a minimum):
- Small number of data used for support of crucial business-solutions;
- Similar process is very slow, as the process of query writing and interpreting them are durable. It takes many days when the chief solution may be necessary to make immediately;
- The problem of various spheres of activity. An expert in information technologies (person who prepares the report) and the chief think in different categories and as a corollary tend not to understand each other;
- Complexity of reports for understanding (an absence of visualization);
- Lack of time (for chief) to find “significant” numbers.
Actually, these problems are not a corollary of poor quality of the corporate OLTP-systems. The problem is in the difference between activity which is automated by the OLTP-systems and decision-making activity. Data stored in corporate database systems are simply the records about some events having a place — the facts, but not information in a general sense of this word.
E.F. Codd, the expert of technologies of management relational systems, said: “… although RDBMS systems are accessible for users, they never were considered as a resource giving power functions on synthesis, the analysis and consolidation (so called, multidimensional data analysis)”. The question is synthesis of the information, on converting the data of operating systems into the information and even in quality standards.
The Solution
Automation or computerization, as a rule, happens historically from “below to upwards”. The defined divisions, which occur in mass operations on data processing (product registration, the warehouse registration, dealing with suppliers and/or clients), are exposed to the process of target automation. This means that mass tasks will be made with the help of computers, which will replace operation with paper media (documents, forms) on the mode, in which operations are mirrored in databases, or (that more often and happens) the information will be duplicated both on paper and in databases.
By virtue of that (or to reflection in the database) is exposed to a computerization each time a small fragment of business processes, and it happens not always in coordination. Most databases are “badly coordinated” with one another. The phrase “badly coordinated” means that in various databases identical “real world” objects may be reflected differently (for example, Smith D. – at charge of wages and Don Smith – at obtaining material assets).
“Inside” corporate databases target transactions that are executed (for example, filing the table of credited invoice). Such systems usually are called online transaction processing systems (or OLTP). Target transactions change a state of databases and bring them into accord to a current state of that fragment of the real world which is simulated with the database. Thus, the main assignment of target databases are transaction processing.
Another big layer of the operations fulfilled with corporate databases is reports creation for the defined period of existence of the database. The form of these reports may vary and, as a rule, is determined by external rules (bookkeeping rules and laws) and internal rules (duty regulations, orders and instructions) within the framework of which company works.
The databases created during such automation, as a rule, are optimized for fast execution of necessary transactions and store the data for a small temporary period. Now it is possible to ask ourselves: “What processes have undergone automation? What operation of staff was automated and simplified?” The answer for this question is obvious: operation of performers which fulfill standard procedures are automated. Automation almost has not affected top-managers – the staff responsible for decision making.
Creation of data warehouses – “deckhouses” above corporate databases allow one to proceed to the next stage of business activity automation – creation the platform for decision making support. The main difference of activity on decision-making from performing daily, operational activity, from the point data used, is the need for comprehensive vision of processes for all diversity of parameters on which they depend, for various temporary intervals. It is possible to say that performer work with the data on happening processes whereas the information is necessary for decision making for managers. For creation of decision support systems it is necessary complete, noncontradictory, the information for various temporary intervals which may be as is generalized (sum or aggregated in another way), and detailed. These requests are defining at creation of data warehouses as bases for creation of decision support systems (DSS).
For transition from the process of automation business of processes to automation of decision-making process we use software products which can be divided into three categories:
- Reporting tools;
- OLAP tools;
- Data Mining tools.
Report generation tools are intended for data acquisition in tabular form, diagrams and (geographical) maps. This class of tools allows managers to inspect happening processes, having some of the fixed sights on metrics of these processes.
OLAP tools are aimed at a test of hypotheses — they allow one to find the data which confirm or refute the formulated administrative hypotheses. Hypotheses may be formulated as very much definitely (whether there is a dependence between falling arrived for the last half-year and rises of product cost), and is more indistinct (whether there are parameters which distinguish the division which has brought the greatest profit on all remaining company divisions). This information allows managers to change processes of firm for reaching definite purposes.
Data mining tools are intended for checking, finding and confirming hypotheses based on the existing data. These tools hardly depend on data domain and structure of input data. However the availability of similar tools are necessary in case of the big data volumes and a plenty of parameters, on which these data depend, as they allow to detect (or other words, to make visible, mapped) the facts and tendencies which are completely invisible at usual review of huge data arrays.
Specific data usage forms (against transaction processing “inside” OLTP databases) generate appropriate requirements to storage and data presentation models. Data warehouses and OLTP-databases run under different query types (or load templates). Corporate OLTP databases are as much as possible optimized for short transactions series execution with maximum performance. Target transactions combine not only data reading, but also data modification and new data insertion. In case of data warehouses we deal, at first, with data acquisition, a great bulk of queries to data warehouse are reading / sampling data. The most widespread model for data warehouses is the three-dimensional model or model of multidimensional cubes.
Attempts to construct analysis / decision making systems which would access directly to OLTP-databases usually appear unsuccessful:
- Analytical queries “compete” with online transactions, locking the data and causing shortage of database server resources;
- Operating data structure / scheme intended for effective support of short and often transactions (it is normalized as a minimum up to the third normal form) and by virtue of it:
- Too complicated for understanding by end-users;
- Does not ensure a necessary execution performance of analytical queries.
- As a rule in a company there are several operating databases. This database has various data structures, units of measurements, methods of encoding. For the end user (business – analyst) the task of construction of any summary query on several similar databases are practically insoluble.
Multidimensional data models allow to store measure values (for example, the sum of sale, an amount of the goods, an amount of shipped products) in the uniform object of the data with defining parameters or arguments; further we shall name these objects as measures (for example, number of the score, date of posting, number of the waybill, the name of the payer, the name and the type of the goods, a title of service). It is necessary to mark, that the data for storage are received from target online corporate databases. Data warehouses contain the information on the various sides of processes which might be mirrored in various target databases in the uniform object, the process of data multiplexing from various sources, or the process of consolidation is necessary.
Most measures (parameters on which measures depend) are inhomogeneous. For example, if we speak about the description of the goods, it is also possible to speak about some commodity section, the type of the goods. Thus, measure “product” may have a name hierarchy, for example “product type” – > “product kind” – > “product unit name”. At stage of data analysis and decision making often it is necessary to work not with the numerical values specific to concrete production; there is enough detailing a level of sort of production or even its(her) type. For support of fast access to this sum the information of repository, as a rule, contain this information already calculated. The process of calculation of the similar sums are we named – aggregation, the calculated values – aggregates.
Data warehouse structures noticeably differ from those used in OLTP-systems. The typical data structure used in warehouses are :
- Star schema;
- Snowflake schema.
These schemes for data warehouses are the same canonical, as third normal form for online transaction processing systems (OLTP).
In a basis of OLAP concept lays the principle of a multidimensional data presentation. “Father of the relational theory” doctor E.F. Codd has considered disadvantages of relational model, first having showed impossibility “to unite, view and parse the data viewed from multiplicity of measurements, that is the most understandable for corporate analysts a method”, and has defined common requirements to OLAP tools which are expanding functionality of relational DBMS and switching on the multidimensional analysis as one performance.
These 12 rules (according to Codd) which should satisfy the OLAP software are:
Multi-Dimensional Conceptual View | Business-analyst “sees the company world” multivariate and multi-dimensional, accordingly and conceptual data model representation in OLAP product should be multivariate and multi-dimensional on a nature, that allow analysts to fulfil intuitive operations: “slice and dice”, rotate and pivot directions of consolidation. |
Transparency | The user should not know what concrete resources are used for storage and data processing and how the data are organized. Without dependence from that, the OLAP-product a part of resources of the user is whether or not, this fact should be transparent for the user. If OLAP it is granted by client – server calculations this fact also, whenever possible, should be imperceptible for the user. OLAP should be granted in a context of open architecture, allowing the user where he was to communicate through the analytical tool with the server. In addition transparency should be achieved in interaction of the analytical tool with homogeneous / heterogeneous databases. |
Accessibility | Business analyst should have a possibility to analyze within the framework of the common conceptual scheme, thus the data may remain under the control of old, “inherited” DBMS, being thus pegged to common analytical model. So OLAP tool kit should superimpose its own logic scheme on physical data arrays, fulfilling all conversions required for support of an uniform, agreed and complete “user sight” on the information. |
Consistent Reporting Performance | With increasing of numbers of measures and database size analysts should not face with any decrease of productivity. Stable productivity is necessary for maintaining an usage simplicity which is required for finishing OLAP up to the end user. If the user – analyst will test essential distinctions in productivity according to number of measures then he will try to compensate these distinctions the strategy of development that will call data representation other ways, but not with what it is really necessary to present the data. Costs of time to bypass the system for compensation of its inadequacy is not what analytical products are intended for. |
Client-Server Architecture | Large data volumes, required operating analytical processing stored on mainframes, but extracted from PC. Therefore one of requests – ability of OLAP products to operate in client – server environment. Main idea here is that OLAP tool server component should be intelligent enough and can build the common conceptual scheme based on generalization and consolidations of various logical and physical schemes of corporate databases. |
Generic Dimensionality | All measures should be equivalent. Additional performances may be given to separate measures, but as all of them are symmetric, the given additional functionality may be given to any measure. Base data structure, formulas and report formats should not base on any one measurement and should not be displaced aside to any measure. Each measure should be applied irrespectively to its structure and operational abilities. Additional operational abilities may be granted to any selected measure, and as measures are symmetric, any function may be given to any measure. |
Dynamic Sparse Matrix Handling | OLAP tool should guarantee optimal processing of the sparse matrixes. Access speed should be saved without dependence from data cells layout and to be a constant for the models having different number of measures and different data sparse. |
Multi-User Support | Frequently some analysts have the necessity to work simultaneously with one analytical model or to create various models based on the same data. OLAP tool should grant them competitive access, guarantee integrity and data protection. |
Unrestricted Cross-dimensional Operations | Data calculation and manipulation on any number of measures should not prohibit or limit any ratios among data cells. The conversions requiring arbitrary definition, should be set in functionally complete formula language. |
Intuitive Data Manipulation | Directions consolidation, detailing data in columns and rows, aggregation and other data manipulations inherent to hierarchy structure , should be executed in maximum convenient, natural and comfortable user interface. |
Flexible Reporting | Various methods of data visualization should be supported, other word reports should be presented in any possible orientation. |
Unlimited Dimensions and Aggregation Levels | Strongly recommended, that each serious OLAP tool should have a minimum of 15 (better more than 20 measures in analytical model. Moreover, each of these measures should admit practically unlimited amount of aggregation levels, defined by user, on any direction of consolidation. |
It is necessray to consider this set of requests, being the actual definition of OLAP (by E. F. Kodd), as recommendations, and evaluate concrete OLAP-products on degree of approximation to correspondence to all above 12 rules.
This completes the first part (from a series) of the articles concerning OLAP technology — technologies of data analysis, the “platform” which allows modern organizations in the modern world to build the business-tactics to reveal market evolution tendencies and to find new solutions stipulating successful development in conditions of competition.
Later, in my future articles you can find a description of utilities / tools, related to OLAP technology:
- Data transformation tools;
- Data analysis tools;
- OLAP-administration tools.
Ideally, in my future articles, I would like to proceed from the theoretical description of OLAP to a more practical presentation and the technical application to a solution of those practical tasks which are put before us in our our businesses. I hope to make it with your help and your advice.