There has always been a need for some IT users to access data from multiple operational systems across the enterprise. Central accounting required data from multiple financial applications, just-in-time shipping systems needed access to data from receiving sites as well as warehouse stock and re-stock databases, and audit systems touched almost everything. In addition, as companies grew and merged, some user departments created their own IT solutions for a variety of reasons.
In a process called data integration, companies developed a centralized IT solution that presented data from across the enterprise in a single application. Users of the application perceived the data elements to be in a single location, with data accessible using a common interface. IT usually accomplished this by creating processes that extracted data from the separate sites, did required data cleaning and transformations and loaded the data to a single physical location. However, with the advent of big data applications, there is now too much data across the enterprise to transport and store in a single place. Data integration had to be re-defined in order to access separate data stores in-place with available access in real-time.
Initial Attempts at Data Integration
In large companies it became common for separate departments to develop their own information technology (IT) solutions. They ranged from simple, stand-alone solutions to data marts with business analytics capabilities to entire operational systems. Sometimes these resulted from mergers and acquisitions, as multiple sites from multiple companies were allowed to run on their own rather than consolidate into single sites. Sometimes these occurred as departments perceive the central IT service as being too slow or unreceptive to their needs.
In order to speed development of applications that required access to data elements across multiple platforms, the concept of data integration was born. Central IT departments developed a single interface to the disparate data while providing a common mechanism for accessing data elements on different hardware platforms, effectively hiding data location from the user. As IT began to develop systems that accessed these multiple sites without requiring the user or analyst to know data location or format, the concept of data integration was born.
The idea of accessing data that resides in several locations from a central platform or application has several obvious advantages. Users perceive the data as being in a single location accessible by a common interface. The interface hides the complexities of accessing data stored in different files or databases from different vendors, in differing formats and with potentially different encoding schemes (e.g. ASCII and EBCDIC).
However, there are several disadvantages and costs inherent in constructing and managing a data integration application. Performance is the biggest concern. Consider an SQL query that joins two tables on different hardware platforms. Although the join itself may produce a small result set, it may be necessary to transport each of the joined tables to a central location, transform all the data appropriately, and then perform the join. The next SQL query may cause the same operations to occur again.
Another consideration is security. Data on different hardware platforms or managed by different operation systems may be controlled by multiple security packages having differing data permissions. Some security permissions may exist at multiple levels (i.e., permissions may be granted to an entire table, only specified rows, or only specific columns). Users querying data in multiple locations may require a complete security profile in each location.
One final issue is storage costs for redundant data. In the case of the multi-location table join mentioned above, performance concerns may dictate that the joined tables be sorted at their source location prior to transport to the central location. This means that each data location must provide storage for sorted files and workspace used during sorting, rather than maintaining a single pool of sort-related storage at the central location. Similarly, for locations that encode data differently (ASCII versus EBCDIC), at least one location must provide storage for converting data to the common encoding scheme to be used by the integration application.
Enter the Data Warehouse
The enterprise data warehouse was one method of dealing with these concerns. On a regular basis operational data was extracted, transformed and loaded to the warehouse, a process called ETL. The ETL jobs handled obtaining operational data using the appropriate security permissions, cleaned the data, performed any required data or encoding conversions, then sorted and loaded the data to the warehouse.
The ETL process typically ran outside of the business day in order that it not interfere with operational systems performance. In addition, warehouse loads could sometimes take hours. Therefore, warehouse data was historical, not current data. While having historical data opened the possibility of business analytics queries to look for trends and correlations over time, the lack of access to current operational data meant that the data warehouse was not optimal for data integration.
Next Came Big Data
The first big data applications were considered to be standalone. Data was stored in a proprietary format in a hybrid hardware / software configuration called an appliance, and business analytics queries accessed only the data in the appliance. As big data applications matured, several possibilities for expansion, performance tuning and query complexity became available, depending upon the appliance vendor and the database management system. This opened up the big data solution to possibilities other than mere analytics. Perhaps one could store operational data (or copies of it) in the appliance? Indeed, some companies have greatly expanded their big data applications to include relatively current versions of operational systems data, either loaded from extracts or replicated real-time from the source. This was possible due to the huge data storage available in appliances, sometimes in the thousands of terabytes or higher.
Regrettably, big data applications do not provide a complete data integration solution.
First, the proprietary data storage mechanism is typically to store data column-wise rather than row-wise. That is, all of the table data values for column A occurs at the beginning of storage, followed by all the values for column B, and so forth. Along with massive I/O parallelism this configuration makes business analytics queries perform extremely well; however, other types of queries pay a performance penalty.
Second, performance tuning options for appliances are limited. DBAs have many methods of tuning queries against tables in a database management system, including indexing, data partitioning, data clustering and query transformation. Few of these are possible in the appliance.
Third, storing data in a central appliance does not solve some of the problems mentioned previously about centralizing data, such as the additional costs of transporting and transforming operational data, security concerns and storage costs for redundant data.
The Business Needs for True Data Integration
Internal users now demand access to operational data in real-time in order to make business decisions or to create value-added applications. Some new applications have critical functions or features that require this. For example, an on-line store may wish to provide customers with suggested products to buy, products commonly purchased or even a customer’s purchasing history. Financial applications might provide options to a customer based upon their credit history or account mix. Medical applications could give practitioners up-to-date statistics on the effectiveness or risks of certain treatments. Auditors may need tools that review financial transactions for possible fraud.
This requirement for access to operational data in real-time occurs in an IT environment where data volumes are growing ever larger and departments are implementing application solutions where speed of delivery is more important than data integration. In this environment, any long-term data integration effort must assume that corporate data files and databases will continue to exist and appear across multiple different hardware platforms in several locations. The need is to access all of these platforms containing lots of data in real-time with acceptable performance.
Finally, all of these disparate applications, data files and elements must conform to the enterprise’s disaster recovery plan. If data is to be truly integrated, recovery from a disaster must incorporate recovery plans for each site and for the data integration application as well.
The Next Phase of Integration: Data Virtualization
We now recognize that corporate data will remain in place across diverse hardware and software environments, and that integration must be managed in a central application. In order to mitigate the costs and issues inherent in this architecture, a data virtualization function must utilize the following tools and techniques.
Enterprise data dictionary. This is the central requirement for a robust data virtualization system. As data are drawn from multiple applications and platforms there will almost always be format and content issues. For example: an account number element may be stored as numeric or character data in different lengths; dates may include embedded slashes or dashes, with year, month and date in differing order; some data values for an element may have special meaning, such as a date of “00/00/0000”. The ability to query or join tables with these elements correctly requires a data dictionary.
Query optimization. As noted earlier, one of the performance issues associated with multi-site data is that table joins may require pre-processing such as sorting and transporting entire tables to a central site. An advanced data virtualization system should be able to perform advanced query optimization coupled with high-performance direct data access in order to optimize these operations. For example, the query optimizer, noting that only certain columns of a table are required by the query, could request transport of only those data elements to the central site, rather than transporting entire rows. Another option would be to ship a subset of the join keys to one site in order to pre-select which rows of the target table would qualify for the join, thereby limiting what rows were returned.
Integration with data warehouse and big data applications. It might seem that the requirement for real-time access to operational data would exclude the warehouse and big data. After all, these repositories are primarily historical data used by business analytics. However, both of these environments have morphed into mission-critical systems. Part of this is due to the way that business analytics has changed over the years. Originally, analytical queries were ad hoc, “what if?” kinds of queries used to find trends or compare subsets of historical data. Over time, as queries provided more and more useful data, certain queries were implemented as regular reports, while others of value to internal users were included in some applications. The result was the integration of analytics into business applications that were mission-critical, thus forcing the warehouse and big data systems to become mission-critical as well. This meant implementing disaster recovery solutions for the warehouse and big data applications. Therefore, any data virtualization solution must include these two platforms as data sources.
The Future
Advanced data virtualization provides many opportunities for the organization. The ability to access and compare information across all or most of your operational systems gives you an advantage when determining how new laws and regulations will affect your operations. Auditing access and use of personal information is now easier. Finally, developers creating new applications have a better idea of what corporate data already exists to support their endeavor.
This future will not be free. Data virtualization requires sophisticated application program interfaces to disparate hardware and platforms, and the surge in data volume means you must plan for better and faster networks and data storage. There is also the need to know your data; building an enterprise data dictionary from scratch can be a daunting challenge! However, the payoff is enormous, as you will then have visibility to corporate data from a central platform in order to better manage both data and applications.