Early adopters of Big Data technology were sold special hybrid hardware / software solutions that included descriptions like, “black box”, “plug and play”, and my favorite, “no performance tuning required”.
Today, as solutions have matured and customers have become better educated in the capabilities of these vendor solutions, everyone now realizes that, yes, in order to reach their full performance potential, Big Data applications require some tuning. The tuning isn’t easy, it’s not free, and responsibility for understanding requirements and implementing the appropriate tuning methodology falls squarely on the shoulders of the database administrator.
Don’t Big Data Applications Run Really Fast?!
Big data applications are advertised as having extremely fast access times. The promise of the technology is the ability to quickly and easily analyze large amounts of data and derive from that analysis changes to customer-facing systems. Management believes that this analysis and subsequent changes will drive up customer satisfaction, market share, and profits.
The key to big data application performance lies in the architecture of the solution, commonly called an appliance. Within the appliance are a large number of high-capacity disk drives, many I/O channels, and software capable of a high degree of parallelism. To take a simple example, consider a customer table with 100 million customers. A standard SQL query of such a table for “customers in region A who purchased product X” may take quite some time, perhaps even requiring scanning the complete table from beginning to end.
Now consider the following appliance configuration of 100 high-capacity disk drives, and a high-speed I/O channel to each drive. Prior to our query, we stored the 100 million customers by spreading them across the disk drives, 1 million customers to each drive. The appliance software reads our query, executes it simultaneously against each of the 100 disk drives, and combines the results.
As we can see, this parallelism contributes significantly to the performance of the appliance. Simply by dividing data across disk drives, query elapsed time can be decreased by a factor of almost one hundred!
Early appliances implemented this by automatically deciding how to partition data across the internal disk drives. The methods and algorithms used were not necessarily disclosed to customers; in many cases they could not be changed.
The Current State of Big Data Appliance Performance Tuning
Most appliances have evolved considerably from the above state. There are now multiple, documented methods of data partitioning on the appliance, and many of these are under the customer’s control. This was inevitable; as the technology of big data solutions began to standardize across vendors, the major differentiating features now include query performance and data transfer rates.
This major evolution comes with a cost. It is now the responsibility of the DBA to understand how the appliance works, develop alternative data partitioning schemes, implement them, and continuously measure performance to determine if additional changes are required.
Apart from basic configuration parameters such as memory allocation, the most important new tuning features for the DBA involve how data is physically distributed across disks within the appliance.
Generally speaking, there are two data distribution categories available: random / hashed, and keyed. These correspond roughly to the horizontal partitioning options available for relational tables in the database management system. Consider a table of sales transactions. In a normal database the DBA considers physically partitioning this table for some combination of maintenance, data availability or performance reasons.
For example, one option would be to partition the table by transaction date, perhaps by month. Beginning partitions would include rows with the earliest transaction dates, later partitions with later dates. This allows for easy purging of old or stale data by simply emptying and re-using the first few partitions. It can also provide a performance boost for many queries against the data by date or date range, as transaction rows with similar dates should reside in the same partition.
Regrettably, in a big data environment, these considerations actually lower query performance. Since the appliance provides its best performance by massively parallelizing data access, clustering the data by transaction date prevents the appliance from utilizing the entirety of its large disk array to break queries into hundreds of smaller queries, each against data on a single disk.
In a nutshell, tuning big data appliances requires knowledge of the types of queries that will access the data, and how various data distribution techniques will affect the performance of these queries.
Additional DBA Tasks
Along with the appliance, DBAs must take responsibility for the performance the entire data acquisition life cycle. Data is loaded to the appliance on a regular basis from staging areas, so high rates of data transfer to and from these areas is necessary; the staging areas are fed from data sourced from originating systems, and much of this data must be validated and cleansed in what is usually called an extract, transform, load (ETL) process. Finally, the data from originating systems must be made available in a timely manner.
All of these things require a systematic approach to data transfer performance tuning. Data acquisition from operational systems consists of data copies, files, and various database extracts. Some data items may be invalid (e.g., a date of 00-00-0000), some may be missing entirely. Each system has its own data cleanliness issues, as well as specific times when data can conveniently be made available for extract. All of these processes require performance tuning as data volumes increase.
Another issue is bulk loading of data into your application. As the amount of input data acquired daily increases, load times also increase, and data loading is extremely I/O-intensive. You may need to look for specific vendor solutions for high-performance data loading.
Last, how can old or stale data be easily purged from the appliance? While query execution may be fast, removing old or stale data reduces the total amount of data stored, which can only help reduce query execution times.
DBAs as Big Data Specialists
Big data provides ample opportunities for gaining and using specialized knowledge. Skills in the following areas provide the DBA team with a significantly greater ability to positively affect the business:
- Appliance management. We now have tuneable appliances, and the DBAs must keep up-to-date on industry trends. You can be sure that your business need for the big data solution is not going away any time soon. Data will continue to flood in to your big data appliance. User queries will increase; indeed, some queries will be deemed so useful that they are upgraded to regular reports. This all translates to an increased business need for the appliance. It may even be designated as mission-critical, at which point the need for constant performance and tuning will be an essential DBA task.
- Business use cases. For a data modeler the first thing to understand is how data will be used. The same is true for the big data DBA. Will old or stale data be archived or purged? Will the heaviest data access be confined to a particular time period? Are certain subsets of the data elements destined to be analyzed the most? Answers to these questions will help the DBA decide on data partitioning schemes, database backup frequency, table and index designs, purge criteria and schedules, and more.
- Analytical query tuning tools. If a business cannot get usable data in a timely fashion, the costs of the solution will outweigh the benefits. Most vendor solutions include the promise (or prediction) that queries will run extremely quickly. While true at this time, will this be so in the future? What if you implement multiple additional very large data stores? What if you begin accumulating years of historical data for analysis? What if several hundred (or thousand!) additional users begin running new, long and complex queries? The DBA is an essential technician for query tuning in this environment. It may not be enough to become an expert in configuring and tuning the appliance itself; the DBA must approach overall performance holistically and include query performance tuning strategies and tactics in their daily actions.
Summary
While there are many types and teams and gradations of expertise across the DBA team, the advent of
supporting big data applications and appliances will change their tasks, priorities, and the way they are managed. In the beginning only a few specialists will have the requisite knowledge and skills; however, as the business implements more applications and adds more users into the mix, the entire team must be involved in big data support.