Big Data is all the rage these days. We are constantly being bombarded and overwhelmed with news, articles, white papers and vendor products related to Big Data. Authors, consultants and pundits assert that their advice is essential, vendors market solutions to our problems, and there are so-called success stories everywhere.
What has been forgotten in all the celebration and noise? The data that drives our companies, manages sales and interacts with customers. The data to which we applied best practices, standards and quality improvement. The data that supports our mission-critical systems. The “other” data … the “little” data.
If we want to incorporate big data into our enterprise the crucial step is integrating in with our existing data.
Reacting to Big Data
In today’s IT world big data applications are common. We have evolved some standard responses to the needs for these applications. One common reaction is that big data requires a “scaling up”: more CPUs, more memory, more resources. Another theme is the availability of special-purpose hardware and software for storing and analyzing big data.
However, big data itself has evolved into a more complex combination of data variations, processing needs, and analytical requirements. We are faced with new and complex data types such as large objects (LOBs), self-describing data such as extensible markup language (XML), and multi-structured data types like images, audio, and web click streams. This is in addition to the expected high volumes and speeds.
The conclusion: big data today is not only a scale-up issue; it is also a re-architecture issue and a data integration issue. Further, it often involves integration of dissimilar architectures. When we insist that we can deal with big data by simply scaling up to faster, special-purpose hardware, we are not only neglecting the real issues: we are leaving our current processes and data — the little data — behind.
How We Currently Implement
It is typical for the IT enterprise to test-drive a new process or idea. This usually takes the form of some initial analyses, perhaps including feasibility studies, proof of concept testing, and a pilot project. In most cases the first project implements BI Analytics using some new hardware or software that accesses current data-at-rest in the production environment.
Forgotten in this new project are current systems and processes. Not that we forget to support current production systems; rather, we bypass our own best practices for new IT system implementation. The result: a shiny new system that produces measurable results from Big Data in production! Regrettably, integrating it into the current DW architecture is now much more difficult.
What part of our little data have we left behind?
How We Should Implement
The typical IT enterprise evolved to its current state by utilizing standards and best practices. These include simple things like data naming conventions to more complex ones such as a well-maintained enterprise data model. Any new data-based implementation only adds to the needs for organization, documentation and governance. With new data and processes in the works you must update documentation, standards and best practices and continue to improve quality.
Costs and benefits of new mainframe components typically involve software license charges. The IT organization will need to re-budget and perhaps even re-negotiate current licenses and lease agreements. As always, new hardware comes with its own requirements of power, footprint, and maintenance needs.
In order to properly service internal IT customers most IT infrastructure managers depend upon both generalists and specialists. Generalists have a broad knowledge of standard processes and do not require a great deal of experience. They are primarily used for tactical work such as process execution, especially in an organized environment where standard processes are well-documented. Specialists have extensive experience and can be subject matter experts in multiple areas. Specialists are essential for strategic work such as project resource estimates, and are especially valuable for highly technical efforts such as software installation, configuration and tuning.
A Big Data implementation brings additional staff into the mix: experts on new analytics software, experts on special-purpose hardware, and others. Such experts are rare, so your organization must hire, rent, or outsource this work. How will they fit into your current organization? How will you train current staff to grow into these positions?
To ensure that the little data isn’t forgotten, here are some prescriptions.
Start with the Source System
This is your core data from operational systems. Interestingly, many beginning Big Data implementations will attempt to access this data directly (or at least to store it for analysis), thereby bypassing succeeding steps. This happens because Big Data data sources have not yet been integrated into your IT architecture. Indeed, these data sources may be brand new or never accessed.
Little data is easily forgotten here. Those who support the source data systems may not have the expertise to assist in analytics, while analytics experts may not understand the source data. Analytics accesses production data directly, so any testing or experimenting is done in a production environment.
Address these needs by involving source system experts from the beginning. Discuss and design how the new data will be compared, joined, and merged with current data. Insist that the new Big Data be defined and documented in an enterprise data dictionary, or at least in a data model. Discussions and meetings about new concepts between dissimilar structures and needs cry out for visual aids, and good data models of current and new data will bridge this gap.
Analyze Data Movement
These data warehouse subsystems and processes first access data from the source systems. Some data may require transformations or ‘cleaning’. Examples include missing data or invalid data such as all zeroes for a field defined as a date. Some data must be gathered from multiple systems and merged, such as accounting data. Other data requires validation against other systems (is an order for a valid customer?).
Data from external sources can be extremely problematic. Consider data from an external vendor that was gathered using web pages where numbers and dates were entered in free-form text fields. This opens the possibility of non-numeric characters in numeric data fields. How can you maximize the amount of data you process, while minimizing the issues with invalid fields? The usual answer is ‘cleansing’ logic that handles the majority of invalid fields using either calculation logic or assignment of default values.
Despite being sometimes semi-structured or multi-structured, data acquired in a big data implementation will still require transformation logic. The key to addressing this will be good documentation of current logic. If rules exist for cleansing data in the current systems, that logic may be used (with modifications) in data acquisition for the big data implementation. In the retail orders example given above, rules exist for cleaning the order data. These may apply to fields in the new data as well.
Review Data Storage for Analytics
This is the final point, the destination where all data is delivered. From here, we get direct access to data for analysis, perhaps by approved query tools. Some subsets of data may be loaded into data marts, while others may be extracted and sent to internal users for local analysis. Some implementations include publish-and-subscribe features or even replication of data to external sources.
Coordination between current processes and the big data process is required. IT support staff will have to investigate whether options to get early use of the data are available. It may also be possible to load current data and the corresponding big data tables in parallel. Delays in loading data will impact the accuracy and availability of analytics; this is a business decision that must be made, and will differ from implementation to implementation.
Some Additional Issues
Left out of our discussion above are issues that pervade all environments.
Test environments. You will typically have both production and test environments, and perhaps even more (called variously development, user acceptance, training, etc.) A big data implementation will require multiple environments as well in order to load test data, develop and tweak analytics queries, and measure performance.
Staffing requirements. New hardware and software require new staff types and expertise. This has happened several times during implementation of ERP packages (SAP, PeopleSoft), database management systems (DB2, DB2 LUW, Oracle), and specialized hardware (Teradata, Netezza). Special staffing usually means that expertise is rare or not available, at least immediately. For the big data implementation consider the need for ‘data scientists’ who will have experience in modeling techniques, knowledge of relevant programming languages, and expertise in the business subject area. Count on additional time and budget to acquire the expertise you need to succeed.
Summary
Big Data is here with a vengeance. Staff in your IT organization are going to conferences and webinars, reading white papers or relevant articles. Some have even taken the first step and brought in new hardware and software for their first implementation.
Beware of the dangers. Big Data is here, but it must not permit you to be distracted from your primary task: take care of the little data.
The Rise of Multi-structured Data, Big Analytics and Why It Matters