Most IT enterprises installed big data applications and realized at least some actionable intelligence. Analysis of customer transactions led to better product marketing and deep analysis of product sales across geographies led to better shipping, marketing, and sales decisions, all with a measurable increase in profits.
What is next for big data? Some experts claim that data “volumes, velocity, variety and veracity” will only increase over time, requiring more data storage, faster machines and more sophisticated analysis tools. However, this is short-sighted, and does not take into account how data degrades over time. Analysis of historical data will always be with us, but generation of the most useful analyses will be done with data we already have. To adapt, most organizations must grow and mature their analytical environments. Here are the steps they must take to prepare for the transition.
Big Data Comes of Age
Big data applications have matured in the last several years. Beginning implementations focused on customer and product analytics, with the multiple goals of providing customers with desired products, shortening product delivery times, and increasing profits. Analysts used rudimentary statistical techniques to derive relationships and trends, and used the results as actionable intelligence to pass on to marketing, production and shipping.
The next phase of big data is now here. Applications and analysts must now create value in two ways: by automating current analytical results to feed back into operational systems and by turning to analysis of unstructured and unmodeled data.
Another aspect of the current phase is the purging of old, stale data. As data ages it tends to become less relevant for the following reasons:
- Newly implemented operational applications will not have a data history;
- Older products are removed and replaced by new products;
- Older customers may no longer exist;
- As you apply maintenance to current operational systems, some analyses of ‘old’ behavior becomes irrelevant;
- Older data tends to be less accurate and sometimes is missing altogether; as operational systems are adjusted to fix these problems, inaccurate or missing historical data will skew analyses.
More Data? Less Data?
While more and more data becomes available, in parallel with that we have more and more data being purged. (While archiving data is also a possibility, archives of operational data tend to be created due to legal retention or compliance issues. Archiving of the data in the big data applications, ostensibly a series of copies of operational data, is rarely required.)
The result is that, after initial implementations of big data applications have reached the critical mass of data volume that allows for meaningful analyses, additional increases in data size have less impact on results. For example, it is common to retain a three to five year sample of customer purchase transactions. However, a ten year collection will have less value, due to market changes, changes in customer buying habits and discontinued products.
Meanwhile, other forces are at work that profoundly change the landscape of big data applications and business intelligence. The advent of new data types such as XML, web click streams, machine sensor logs and telephone response unit recordings has ushered in a new set of data types. Each of these data types may require a separate hardware platform for storage, and special software for access and analysis.
In the remainder of this article we review the differing data types available to the enterprise, and how IT enterprise resource management strategy must change to fit the new landscape.
Classic Big Data
Many of the first big data applications were installed and integrated with the enterprise data warehouse (EDW). The main reason is that the EDW already contains dimension tables such as time, sales region, product type, customer profile and organizational hierarchies. These tables are joined with other fact tables in the warehouse as well as the big data tables in order to aggregate based on the dimensions. For example, a big data database of customer transactions could be analyzed by sales region and product type to determine which products are selling the best in which regions.
These big data applications usually held structured data that the enterprise already understood. Data elements already existed in the enterprise data model, since they originated from current operational systems. Analysts already understood the data and data types, and already had experience querying data in the EDW. Consequently, simple analytical tools and queries generated usable results and reports.
Now, this environment has become commonplace. What were once ad hoc queries to investigate possible trends and relationships are now regular reports. Data types have remained relatively constant, while data volume has expanded. At the same time, purging stale data has slowed the growth of the big data application. While at one time the big data database was growing exponentially (perhaps doubling in size every year), now the growth is stabilizing and has become linear.
Analysis of these data is also stabilizing. Most of the big value trends and relationships have already been found. Additional queries and investigations provide less and less actionable intelligence.
How should IT strategy change? For this environment, plan for linear increases in data storage media, memory and CPU power. Forgo upgrades or improvements in analytical tools. Concentrate on ruthlessly purging stale or unused data. Save resources for the new phase of big data.
Analyzing Unstructured or Unmodeled Data
Many new operational systems exist not in an IT center but adjacent to the customer. These include store kiosks for price checking, on-line product ordering and shipping, smart telephone customer interfaces and other internet-based customer applications. The result has become a virtual stream of data, voice responses, mouse and keyboard clicks, kiosk usage logs and other new data types.
These data are the latest places for analysts to mine for actionable intelligence. On-line orders indicate which products are most profitable by customer type, timing, and geography. Telephone voice responses indicate customers’ most frequent questions. Even knowing what product catalog pages a customer visits on-line may signify customer product interest. New ways of analysis are required because this explosion of data creates information about customer preferences, product reputation, and even intelligence about competitive products.
The IT challenge of these data is that they may require multiple hardware and software platforms to decode, store and analyze. Part of the new IT infrastructure requires parsing the data for errors and data transformations then storing the data in a form compatible for analysis software. The challenge is that each new instance of this form of big data may require implementation of multiple decision steps for processing.
IT strategy must first concentrate on implementation of each big data case so that the initial analyses can produce value. This is necessary not only to justify the hardware and software expense, but also to ensure that IT staff and analysts get experience in installation, configuration, maintenance and analytical support. This will be necessary for the next phase of big data, federation.
Options for Unstructured Data
IT alternatives for implementing these new big data applications include the following:
- A big data appliance. These are the bread and butter of current big data applications. They are comprised of a hybrid hardware and software system that stores data across hundreds or thousands of disk drives. Analytical queries against the data are split into multiple queries, one for each disk. The queries are then run in parallel, with the software combining the many results into a single one. This massive parallelism is what makes these appliances execute queries so quickly. This solution is best for structured data that does not change quickly.
- Apache Hadoop. The Hadoop solution involves multiple computer nodes (potentially thousands) in a network. All the nodes participate in data extract, storage, and retrieval, and also combine their efforts for query processing. Hadoop includes a file management system that allows handling the nodes from a central point. This solution is best used when analysts must execute queries in waves, with initial queries used to explore possibilities with the results saved for further query and analysis.
- NoSQL databases. These systems are used primarily for an analytical technique known as graph analysis. Graph analysis is a method of linking data elements and relationships to allow analysis of networked data. Examples include social networks or comparisons of multiple customer transactions across multiple time periods and geographies. This solution requires an experienced IT staff and highly-trained business analysts.
As noted above, your IT infrastructure will probably contain multiple instances of these alternatives. This is in addition to current big data applications and your enterprise data warehouse.
IT Strategy for the Future
Each of your big data solutions will require a specific hardware and software solution that is initially customized for certain analysts. Each environment will require a separate set of IT staff, from database administrators to capacity planners to data scientists.
As each environment grows, the data stores will begin to overlap. This begins with the data warehouse dimension tables, as these provide the most common aggregation and summarization points. For performance reasons, these tables will be copied or replicated across your big data environments.
The next set of data to overlap will be the fact and transaction tables containing your products and customers. Each analytical environment handles queries against its big data tables, but the results are then usually joined back to the EDW fact tables in order to generate actionable intelligence. For example, your Hadoop environment may predict what product mix will be most profitable in certain geographies, but these results need to be joined to customer fact tables in order to market to certain customer types.
As overlap progresses, one thing is clear: IT strategy must take into account such current and future overlap when providing options to business analysts. Which platform or solution is best for analyzing customer preferences? Which solution (or solutions) have relevant data for which analyses?
The answer for the future is data federation. This involves combining data and processing solutions in both a virtual and physical way. You need to provide your business analysts a single portal through which all solutions are available. The portal should allow the analyst to see a picture of all the data available for analysis without regard for physical platform. It should also assist the analyst in constructing ad hoc queries against the data, route the queries to the appropriate platform, and return the results to the user.
To make this work, IT strategy must include capacity planning across the various big data platforms, coupled with selection of business intelligence analytical software that allows for intelligent query routing. These portals tend to be home-grown, as few vendors currently provide integrated solutions.
Summary
As big data implementations mature, companies are expanding beyond their current store of structured, well-understood data into more exotic and unstructured data types. This, coupled with the multiple analytical platforms that are available, leads to a proliferation of hardware and software solutions. While each of these solutions can be justified in part by its ability to generate useful analytical results, the future will bring about data overlaps across these solutions that will require an integrated answer for the analyst.
IT strategy should plan for this by reviewing current and potential big data applications and architecting an initial portal solution that can present an integrated solution set to the analyst. Implementing this solution may be years in the planning; however, as data from operational systems begins to spread and replicate across multiple big data platforms there will be a need for analysts to know what platforms are savailable and how to use them without regard for highly technical, platform-specific knowledge.