Many big data application implementations seem to begin with an existing data warehouse, one or more new high-volume data streams, and some specialized hardware and software to support data storage and business analytics. The data storage issue is often accommodated by installing a proprietary hardware appliance that can store huge amounts of data while providing extremely fast data access.
In these cases, do we really need to worry about database design?
Data Modeling in a Big Data Environment
Most database administrators agree: good database design is part of system and application design. There are many business requirements, such as data availability, purge processing, and application performance that are addressed using specific database design options.
What about big data? Interestingly, vendors that supply hardware and software solutions for business analytics against big data claim that database design is less important. They argue that since the data is stored in their proprietary format, most database design considerations do not apply.
Confusion over this issue usually originates from misperceptions regarding how special solutions execute big data queries. Simply put, in most cases your data will be physically stored in two places: your current production database management system (DBMS) and the new special-purpose appliance. Your current production processes that extract, transform, and load data into the current DBMS continue to operate as-is, with an additional step: at every point that you load data into a table you will also ensure that the new data is loaded into the appliance as well.
Loading data into the appliance can be done immediately after your DBMS loads, or can be batched for later execution. The important thing is that data must be loaded into the appliance before any big data queries can utilize it for the advertised performance gains.
Database Design for Quality Assurance
What does a quality database design mean? Generally speaking, a database design starts with a model of the data and the business rules that define their relationships. For example, an order is always associated with a customer, and a customer may have zero, one, or many orders. Along with these things and the data element definitions and attributes, the database design will address, cope with, or mitigate risks in the following areas:
- Assist with defect prevention by automating data element valid values checking;
- Permits defect detection and remediation during application construction and testing;
- Moves data verification as close to the source as possible;
- Provides stability, reliability, data accessibility and system scalability.
What Will the Database Designer do Differently?
A poor quality database design affects technical support the most. They are the ones that must deal with system problems in real-time. This drives up the cost of problem determination and problem resolution. It also manifests itself in product behaviors that may annoy … or drive away customers. The most common problems associated with bad designs are poor application performance or data contention.
Typical fixes include database reorganizations or re-design, adding table indexes and changing table partitioning or clustering. However, in a big data environment, these options are typically not available in the appliance. They will only exist in the base tables in the database proper. This is the crux of the matter: despite the vendor’s claims that all your data can be moved into the appliance, this is seldom the best solution.
Having your data co-exist between the main database management system and the appliance is a best practice for several reasons.
Avoid single point of failure. The appliance is a single point of failure. Despite the best efforts of the vendor and your support staff the hardware, software, network connections or processes within the appliance may fail. If they do, how will queries be satisfied? With data co-located in the database management system, query results can be satisfied by accessing the base tables. Granted, performance may suffer; however, the alternative is that your big data application will be unavailable until someone fixes the problem.
Provide for data offloads. Queries are not the only consumers of your data. One common use is offloading production data to a test environment. In addition, some third-party vendor software tools access the data in the database natively and directly, which is not available in the appliance because it stores the data in a proprietary format.
Backup and recovery. Most common backup and recovery utilities are based on data residing in the database. Again, third-party vendor tools are commonly used for high-performance backups and recovery, including index recovery. These backups are executed against the base tables and tablespaces, not against the appliance.
Certain performance situations. There are some situations in which SQL queries are not executable in the appliance. Such limitations are defined in the manuals, and vary across vendor appliances and versions. In these cases, you have no choice; you must access the base tables and accept the performance degradation. Some of these limitations involve specific SQL syntax such as scrollable cursors, dynamic SQL, use of multiple character encoding schemes, certain correlated table expressions, and the use of certain built-in functions.
Database Design Decisions for Big Data
Since you will be keeping your data in both the DBMS and in the appliance, your standard database design rules still apply. Interestingly, some of the rules are now expanded or more complex due to the existence of the appliance. Some of the considerations are as follows:
The need for indexes. Indexes serve multiple needs: they can enforce data element uniqueness, they can enforce referential integrity relationships, they define primary keys, and they define additional access paths. This last item is important. In a big data environment the idea is to push long-running queries into the appliance for high-speed processing. If certain indexes exist simply to provide alternative access paths, they may no longer be needed. Part of database design or re-design should involve a review of so-called performance indexes. If the index is no longer being used by queries it can be dropped, thereby saving disk space, processing time, and recovery time if the table data needs to be recovered.
Removing appliance SQL limitations. Usually the business rules for the data determine portions of the database design. These include physical partitioning to permit faster queries and easier data purging, data element domain checking such as column constraints, and definition of primary and foreign keys to support referential integrity rules. Application developers then code SQL queries to access the data. In addition, users may have reporting tools that automatically generate SQL for queries and reports. Since the SQL query syntax and options will depend upon the database design, the designer needs to keep appliance limitations in mind.
Designing for high-speed appliance data load. Normal database load processes now contain an extra step: loading data into the appliance as well. How is this best accomplished? Depending upon your application and on your data volatility, you may wish to consider variations on the following:
- Regular bulk load (daily, hourly) of the appliance, with the understanding that data there will not be completely current.
- Trickle load, where row updates to the base tables are then propagated synchronously to the appliance. This keeps appliance data current, but row processing is much slower than bulk loading.
Big data and appliances have not removed the need for good database design. Indeed, the designer has more things to consider: backup and recovery, index management, multiple methods of data access, and SQL limitations. The good news is that advances in database hardware and software can speed up data queries to a remarkable degree.
Carnegie Mellon Capability Maturity Model
IBM TechDocs library: Information on autonomics — “A First Look at Solution Installation for Autonomic Computing”, IBM document SG24-7099, available at the
IBM Quality management solutions
American Productivity and Quality Center
American Society for Quality