Big data is everywhere, and most large IT enterprises have installed one or more big data applications. These applications provide fast access to large stores of data, usually customer or sales data. Your technical staff that supports these applications and the systems that analyze and consume the data didn’t exist ten years ago.
Who are these new IT professionals, and how should you manage them?
Data Acquisition Support
The data for your big data application usually comes from snapshots of transactional data that flows through your operational systems. Any data stream containing information on customers, prices, costs, products, accounts, and the like are fair game. Generally speaking, the more streams you can acquire, store and analyze, the more relevant and detailed your analytical results will be.
Consider a file of daily customer transactions. For each transaction it probably contains product and sales data for purchases, including date, products purchased, prices, quantity sold, location purchased, and so forth. This is a good first step for analyzing what products are popular, which products are generally purchased together, and what time of year or geographic location specific products are bought most often.
However, what about the customers? Where do they reside, why did they come to that location? Adding customer address information will assist you in analyzing this.
Data acquisition specialists are in charge of knowing the various sources of data available across the IT enterprise. In addition, they must also be aware of data sources outside the company that can purchased. Any and all of this data may be required for analysis.
These specialists must also coordinate and communicate with the business analysts, who will want to know what data is available, or may request specific data. The result is an interesting job description that requires knowledge of multiple enterprise applications, customer data, and IT best practices such as data modeling techniques and the enterprise data model.
These specialists need to increase their knowledge of current operational systems. What data do they provide, and who uses the results? The specialists will become subject matter experts in several areas. This makes them valuable as internal consultants and advisors on matters such as data availability and query possibilities.
Another new area will be that of semi-structured and unstructured data. There are new and complex data types such as large objects (LOBs), extensible markup language (XML), audio and video files, and the like. Some of these new data types may cause confusion. How will they be stored and searched? Options exist for this in many business analytics software packages.
Management of these professionals will require allocating time and resources for training. While some of this will be focused on new hardware and software, the majority will be cross-training on internal applications and systems. The DA specialist needs to understand data across the enterprise;
Big Data Storage
Initial big data applications tend to be simple, though big! As noted above, the first application will usually consist of a customer-related data stream. This provides business analysts with a familiar data store to query, and such analysis usually generate indicators of customer preferences such as favorite products. This then generates actionable analytics to the business, such as tiered product pricing, sales, discounts, or geography-related variations in products or prices.
Initially, the data is stored either in a high-performance database management system (DBMS) such as DB2, or in a big data hybrid hardware/software appliance such as the IBM DB2 Analytics Accelerator (IDAA). Both the DB2 database and IDAA solutions permit complex queries, though they do have different performance profiles that depend upon data volume, query complexity, and so forth.
The result is a store of large data that must be managed to support high-intensity querying. This is in contrast to normal operational systems, such as order entry, where new data changes with high frequency throughout the day. In big data applications, snapshots of large data stores are acquired and loaded into the DBMS. After that, the database administrators (DBAs) must be responsible for fast data access. This can be accomplished in a variety of ways, including memory management, adding data indexes, using high-speed disk arrays, or using a hybrid solution (as with IDAA above).
DBAs, then, now expand their job description to include management of this big data store. Management tasks include:
- Configuring disk storage
- Software and hardware installation and tuning
- System, network, and DBMS tuning
- Assisting with data architecture changes
- Analytical query performance tuning
Another new task for the DBA is coordinating with management to prioritize their work. IT strategic planning encompasses multiple existing and new applications across the enterprise. Big data is still new, and in addition to having installation, performance and tuning expertise, the DBA must look forward to maintenance of the new environment. Will the big data require backups for safety or recovery? If there is a disaster, will the data, hardware and software be automatically set up at the disaster recovery site? How will old or stale data be removed when it is no longer required?
DBAs must now use their knowledge to develop support plans for the big data applications. These include:
- Upgrades to the big data hardware and software
- Capacity planning; that is, memory, disk storage, and other resources specific to the DBMS or an IDAA-like device
- What query / analytics tools are used to query the data
- Data volume estimates
- How data will be grouped most often (sometimes called dimensions)
DBAs will also assist analysts and application owners. They will consult on data models and business use cases, assist in query writing and tuning, and gather information on the dimensions of the data. This last is extremely important. Data dimensions refer to how data can be grouped for summarization. Examples include time (daily, monthly), geography (state, metropolitan area, store), product type, sales organization, and more. Since data in a single dimension (say, the current month) will very often all be accessed in a single query, the DBA can increase performance by storing the data physically adjacent. One common method is to logically or physically partition the data by the time dimension. Month 1 is stored in File 1, Month 2 in File 2, and so on. This partitioning method has several advantages, including:
- Parallel access, as multiple queries against different dimensions can execute simultaneously
- Faster backup and recovery, as current data is more critical than historical data
- Easier purge or archive, as the oldest data will most likely reside in a single partition
Management of DBAs will most likely require a split of responsibilities. Depending upon your hardware, software and DBMS installations, it is best to relegate generalists to standard support tasks such as query performance tuning, gathering metrics, and managing backup and recovery. One or more specialists can then concentrate on becoming experts on the new technology. They will then share this internally across the team.
Big Data Analytics
A big data application is a seeming treasure trove of information. How do you access it? Commonly, vendors supply software packages specifically designed to automate the construction and execution of queries against big data. At the same time, many IT shops have already had experience with very large databases (VLDB), and have created in-house business intelligence systems.
The business analyst is usually familiar with querying data, usually that stored in an enterprise data warehouse (EDW). The EDW commonly consists of one or more fact tables containing transactional data, with additional dimension tables (see above) listing dimension values for grouping. This querying expertise will transfer very well to the big data application, with one caveat: big data will eventually (if not already) contain data types not familiar to the analyst. In addition, as data volumes increase it becomes essential that queries are written to perform well. This means that analysts and DBAs must collaborate.
Typically, DBAs can analyze queries, gather performance metrics, and recommend alternative query syntax or other performance-enhancing tactics. In the big data environment, this becomes even more critical. While business analytics software packages are designed to generate efficient queries, and big data application solutions promise high-performance storage, sometimes there is just too much data. Another thing to consider is that amount and complexity of queries against the big data application will increase exponentially as analysts get familiar with the new software and the data.
More data volume, and more queries coming more often translates into very heavy use of your big data solution. This is good! This also means that performance may well become a problem. Plan for this by having analysts share queries with the DBAs, and collaborate on timing and performance options.
Managing the technical staff usually consists of prioritizing tasks and assigning them to available resources. However, the advent of big data applications and the expanded job responsibilities mentioned in this article will lead inevitably to expanded job descriptions and required specializations. A natural reaction is for management to divide teams into generalists and specialists.
Generalists will shift their responsibilities toward maintenance tasks. Generalist’s tasks and procedures should be clearly and thoroughly documented. This is because most generalists will (eventually) want to become specialists. This allows management to take generalist tasks and either automate or outsource them to an external firm.
Specialists will begin with new responsibilities. They will need training in hardware / software solutions, and must obtain knowledge and experience of data and applications across the enterprise. This may take quite some time; however, it is essential if you are to succeed with big data applications.