In today’s IT world big data applications are common. We have evolved some standard responses to the needs for these applications. One common reaction is that big data requires a “scaling up”: more CPUs, more memory, more resources. Another theme is the availability of special-purpose hardware and software for storing and analyzing big data.
However, big data itself has evolved into a more complex combination of data variations, processing needs, and analytical requirements. We are faced with new and complex data types such as large objects (LOBs), self-describing data such as extensible markup language (XML), and multi-structured data types like images, audio, and web click streams. This is in addition to the expected high volumes and speeds.
The conclusion: big data today has scale-up and scale-out issues. Further, it often involves integration of dissimilar architectures. When we insist that we can deal with big data by simply scaling up to faster, special-purpose hardware, we are neglecting more fundamental issues.
Operational Environment
The typical IT enterprise evolved to its current state by utilizing standards and best practices. These include simple things like data naming conventions to more complex ones such as a well-maintained enterprise data model. New data-based implementations require best practices in organization, documentation and governance. With new data and processes in the works you must update documentation, standards and best practices and continue to improve quality.
Costs and benefits of new mainframe components typically involve software license charges. The IT organization will need to re-budget and perhaps even re-negotiate current licenses and lease agreements. As always, new hardware comes with its own requirements of power, footprint, and maintenance needs.
A Big Data implementation brings additional staff into the mix: experts on new analytics software, experts on special-purpose hardware, and others. Such experts are rare, so your organization must hire, rent, or outsource this work. How will they fit into your current organization? How will you train current staff to grow into these positions?
Start with the Source System
This is your core data from operational systems. Interestingly, many beginning Big Data implementations will attempt to access this data directly (or at least to store it for analysis), thereby bypassing succeeding steps. This happens because Big Data data sources have not yet been integrated into your IT architecture. Indeed, these data sources may be brand new or never accessed.
Those who support the source data systems may not have the expertise to assist in analytics, while analytics experts may not understand the source data. Analytics accesses production data directly, so any testing or experimenting is done in a production environment.
Address these needs by involving source system experts from the beginning. Discuss and design how the new data will be compared, joined, and merged with current data. Insist that the new Big Data be defined and documented in an enterprise data dictionary, or at least in a data model. Discussions and meetings about new concepts between dissimilar structures and needs cry out for visual aids, and good data models of current and new data will bridge this gap.
Analyze Data Movement
These data warehouse subsystems and processes first access data from the source systems. Some data may require transformations or ‘cleaning’. Examples include missing data or invalid data such as all zeroes for a field defined as a date. Some data must be gathered from multiple systems and merged, such as accounting data. Other data requires validation against other systems (is an order for a valid customer).
Data from external sources can be extremely problematic. Consider data from an external vendor that was gathered using web pages where numbers and dates were entered in free-form text fields. This opens the possibility of non-numeric characters in numeric data fields. How can you maximize the amount of data you process, while minimizing the issues with invalid fields? The usual answer is ‘cleansing’ logic that handles the majority of invalid fields using either calculation logic or assignment of default values.
Despite being sometimes semi-structured or multi-structured, data acquired in a big data implementation will still require transformation logic. The key to addressing this will be good documentation of current logic. If rules exist for cleansing data in the current systems, that logic may be used (with modifications) in data acquisition for the big data implementation. In the retail orders example given above, rules exist for cleaning the order data. These may apply to fields in the new data as well.
Review Data Storage for Analytics
This is the final point, the destination where all data is delivered. From here, we get direct access to data for analysis, perhaps by approved query tools. Some subsets of data may be loaded into data marts, while others may be extracted and sent to internal users for local analysis. Some implementations include publish-and-subscribe features or even replication of data to external sources.
Coordination between current processes and the big data process is required. IT support staff will have to investigate whether options to get early use of the data are available. It may also be possible to load current data and the corresponding big data tables in parallel. Delays in loading data will impact the accuracy and availability of analytics; this is a business decision that must be made, and will differ from implementation to implementation.
Some Additional Issues
Left out of our discussion above are issues that pervade all environments.
Test environments. You will typically have both production and test environments, and perhaps even more (called variously development, user acceptance, training, etc.) A big data implementation will require multiple environments as well in order to load test data, develop and tweak analytics queries, and measure performance.
Staffing requirements. New hardware and software require new staff types and expertise. This has happened several times during implementation of ERP packages (SAP, PeopleSoft), database management systems (DB2, DB2 LUW, Oracle), and specialized hardware (Teradata, Netezza). Special staffing usually means that expertise is rare or not available, at least immediately. For the big data implementation consider the need for ‘data scientists’ who will have experience in modeling techniques, knowledge of relevant programming languages, and expertise in the business subject area. Count on additional time and budget to acquire the expertise you need to succeed.
Business Data Consumer Communities
Most business data consumers fall in to one of three categories:
Technical users running direct queries. These users create their own queries against data tables using structured query language (SQL). They then use an online SQL execution tool to run the query and produce the results in raw data form, which they can then observe directly or download to a spreadsheet program for further analysis. These users know the data tables, have SQL expertise, and use simple tools to refine the results.
Sophisticated report analysts. These consumers typically use a sophisticated reporting tool that displays a pictorial data model of the data. They then manipulate the model by dragging and dropping tables and columns to a report window. The tool then creates the appropriate SQL statements based on the model and other parameters, executes the query, and displays the results. These users know the data, usually do not have SQL expertise, and require some advanced query and statistical reporting techniques.
Data mart consumers. These users have their own highly-specialized business data analytical software. They directly extract business data from the source and store it on a local server. They then use special-purpose software to analyze the data.
Any big data solution must take these communities into account.
The Disaster Recovery Problem
Most data warehouses are used for analysis and reporting, not for processing business data such as customer transactions. A big data appliance is usually attached to the data warehouse, and so is not usually considered something required at the DR site. However, it may become so! Consider the following scenario:
1. Your company implements a big data appliance;
2. Business analysts and users begin querying the data;
3. Many queries produce results that result in lower costs or better pricing;
4. Queries run quickly, so many analysts begin executing many queries;
5. As more queries produce actionable results, management sees that they have value;
6. One-time queries begin running weekly; some queries become daily reports;
7. The number of valuable, daily reports results in management designating the big data solution and analysis “mission-critical”.
Surprise! IT is suddenly informed that this large data store must be available if a disaster occurs.
To prepare for this happening in your enterprise, review the storage needs, network capacity, hardware capabilities and capacity and software license requirements at the beginning of your implementation. Have this data published and available to management before it becomes critical. This allows your enterprise to budget and plan for its needs in advance.
Summary
Most advanced analytics solutions access large data stores called big data. High-speed appliances will assist business users by significantly shortening query times; however, the best architectural solution will require the appliance to be a part of your data warehouse.
Integrating big data appliance solutions into a data warehouse requires preparation and forethought. DBAs and business data consumers must work together both to address the implementation issues above and to meet the needs of multiple business data consumers.