Big data as an application (or as a service) is being supplanted by artificial intelligence (AI) and machine learning. Few new requirements for a big data solution have arisen in the past few years. All the low-hanging fruit (fraud detection, customer preferences, just-in-time re-stocking and delivery, etc.) have already been big data-ized. Is this the end of big data?
Big Data Evolution
The first big data applications were for forecasting based on historical data. IT extracted operational data on a regular basis (usually nightly) to store within a stand-alone big data solution. Product and service sales transactions could then be parsed and aggregated by time, geography and other categories looking for correlations and trends. Analysts then used these trends to make price, sales, marketing and shipping changes that increased profits.
Data tended to be well-structured and well-understood. Most data items were either text or numeric, such as product names, quantities and prices. These data types were extracted from operational databases and used in various parts of the enterprise such as data marts and the data warehouse. Analysts understood these data types and how to access and analyze them using SQL.
The next step was the transformation of big data solutions into services that could be invoked in real time by operational applications. One common example is on-line product sales, where applications can predict and detect possible financial fraud, suggest customer preferences and issue warehouse re-stocking commands based on what customers order. A major part of this transformation was allowing business analytics queries to access current data, sometimes directly against operational databases. This meant that analysts could make real-time decisions about what was trending today; in particular, they could fix issues such as incorrect prices or poor product placement almost immediately.
Typically, the business perceived a need for understanding data relationships or correlations that may not be obvious. Third-party vendors made available a variety of plug-and-play solutions that usually required a specific hardware platform and special software for data management and analytics. Some examples of these include the following.
Hadoop solutions by Apache uses a network of computer nodes. Each node participates in data storage and the nodes combine their efforts during query processing. Hadoop provides a specialized file management system that gives one or more central nodes the ability to control the others as a group and to coordinate their efforts. This works best for business problems where analytics is done iteratively; that is, business analysts run one or more base queries, store the results, and then run or re-run additional queries against those results.
NoSQL Database solutions depend upon a technique called graph analysis, a method of finding relationships among data elements. Using NoSQL requires an experienced data scientist to support the underlying database as well as business analysts familiar with this analytic technique.
Initially offered as stand-alone hybrid hardware and software solution, appliances support many big data applications. In a single disk array are mounted hundreds of disk drives, and each data entity (customer, product, etc.) has its table data split evenly across all disks. When the analyst submits a query, the software splits the query into hundreds of subqueries, one for each disk drive, executes the subqueries in parallel and combines the results. It is this massively -parallel process that allows the appliance to give results so quickly. IBM’s IDAA (IBM Db2 Analytics Accelerator) is the best example of this.
Most IT organizations tended to choose one of these for its enterprise big data solution. However, in the past few years disk array storage has become larger, faster and cheaper. Combine this with the ability to implement data storage as memory rather than as disks also provided an enormous speed increase. The net result is that IT can now afford to either purchase multiple solutions and install them in-house or outsource the data storage and processing to an external provider, sometimes called database as a service (DBaaS).
Regrettably, accumulating more data hasn’t made prediction and trending analysis more useful. For example, expanding product sales history from five years to ten years isn’t very useful, since over time data changes, products change, databases change, and applications change. (One exception is customer purchase history, since it can be used to predict how customers’ preferences will change over time.) Another issue is stale data. As products reach the end of their useful life and some services are no longer offered, analysts find less and less need to issue queries about them. A final concern is new applications and the new data accompanying them. Since there is little or no history for these new data items, how can one look for correlations and trends?
In short, big data has reached the point where IT has extracted most of the value from its historical data. So what’s next?
The current state of business analytics has shifted from simple big data applications to suites of machine learning and AI solutions. These solutions tend to be specific to either a small number of applications (such as on-line order entry) or a small set of related data (such as product sales transactions).
One example of these new systems is IBM’s Watson Machine Learning (WML) for z/OS. WML is implemented as a service that interfaces with several varieties of data on z/OS (Db2, VSAM, SMF. etc.), creates and trains machine learning models, scores them and compares the models with live metrics. The operations analytics software then classifies and analyzes the results to detect trends and relationships. (For more on this offering see the article, IBM Improves IT Operations with Artificial Intelligence.
There are several important requirements for this new analytics environment to yield value to the organization. They are:
- An up-to-date enterprise data model;
- An emphasis on data quality, particularly in operational systems;
- The regular purging or archiving of stale or old data.
The need for a data model is obvious. How can AI systems develop correct and meaningful relationships among data items if they are not well-defined? The same is true for business analysts, who will assist application developers in interfacing operational applications with the data. For example, consider the development of an on-line product ordering application. The business wishes to detect (and perhaps prevent) fraud, so they want to interface their application with an AI system that can analyze current transactions against historical ones. If the data elements are not well-defined this effort will fail.
Data quality encompasses a large number of overlapping topics. Some data element values can be unequal and yet identical. For example, the street addresses of “123 Jones Road” and “123 Jones Rd. East” may indicate the exact same house in a particular city, but a basic text comparison of the data values results in inequality. Some text fields may contain internal formatting, such as “2019-01-01” and “01/01/2019”; again, the same meaning but unequal.
Dates have a penchant to be invalid or missing. Consider the date “01/01/1900”. If this value is stored in a column labelled BirthDate, is it truly correct or was a default value applied? Similar questions arise for values like “11/31/2019”, “24/01/2019”, “01/01/9999” and even “01/01/0001”.
Data quality issues extend to parent-child relationships, or referential integrity. If the Order table contains an order for customer ABC, then the Customer table should have a row for that customer.
Stale Data Purge
Stale data purges are perhaps the most important data-related issue that must be addressed before AI analytics systems can be implemented successfully. As noted previously, current IT solutions for multiple business analytics needs will grow and expand to include multiple solutions across diverse hardware platforms, both inside and outside the company. For all of these platforms to provide consistent results IT must coordinate the purge of stale data across multiple data stores.
This is more difficult than it might appear. Consider a set of products that are no longer sold by your company. Data on these products (prices, transaction details, geographic data, etc.) may be stored across several databases. This specific data may be of little use in predicting product sales, however, it might be essential in analyzing customer credit or fraud detection. If you don’t purge any of this data you are paying for storage that is wasted; if you purge it from some applications but not from others, you run the risk of losing the relationships across data items. For example, if you purge all “old” product information, what happens to your customers who purchased only those products? Should they be included in customer profile analyses?
Where Will Big Data Go?
In the 1980s there was an IT concept called the very large database (VLDB). These were databases that, for their time, were consider so large and unwieldy that they needed special management. As the size of disk storage grew and access speeds dramatically shrank, such databases were eventually considered normal, and the term VLDB was no longer used.
Such a fate awaits big data. Already we see big data becoming only a single tool in the IT support toolbox. As one example, IBM has taken its IDAA appliance, which was once a stand-alone hardware product, and physically incorporated it natively within its z14 mainframe server. As AI and machine learning software comes of age, they may well depend upon an internal big data storage solution.
Still, just as VLDB went from implementation within only a few companies to almost anywhere, big data solutions will become commodities. In fact, there are now offerings from many vendors of big data solutions that “scale”. In other words, you can try out a relatively small big data application and scale it up in size later if it provides value. The same holds true for many AI solutions.
The Future of AI and Analytics
Businesses will review AI and big data offerings and choose one or more that fit their needs. After performing a proof-of-concept test, they will implement the ones that provide the most value. Then comes the scary part. IT must find a way to coordinate multiple support processes across these multiple hardware and software platforms, including updating and maintaining the enterprise data model, performing data quality maintenance, and coordinating stale data purge. It will require staff with different skill sets and experience, and interfaces to many vendors.
Finally, IT must work towards federating all of these solutions. Even though they span different hardware platforms and come bundled with different software, IT must find a way to give access to all of the data to all of the solutions. This has already happened with the classic data warehouse. Warehouses contain dimension tables that provide the most common aggregation and subsetting definitions. These tables have already migrated into most big data applications, since business analysts will most commonly use these tables for analytics queries. In fact, it is also likely that big data queries will join tables within the big data application to warehouse base tables. The result is that many data warehouses have been moved into big data applications.
Clearly, your IT strategy must take into account this federation in the near future. Part of the original problem was identifying which platforms or solutions are best for analyzing what data. Federation addresses this by identifying a central interface for all of your analytics solutions. To make this work you must do capacity planning across all platforms and include all parties that will use analytical software in your federation solution.
And big data? It will still be with us, but it will take a back seat as the future highlights data federation and artificial intelligence solutions.
# # #