Today’s IT developers use a combination of rapid design and cloud technologies to ensure that new applications can be implemented quickly. Speedy production implementation is essential for two reasons: competitive advantage, and minimizing costs. Some markets such as retail sales and medical support are highly competitive, and the first to bring a new application to market has an advantage. Keeping development costs low by using database cloud services also speeds development, while allowing developers’ IT departments to forgo expensive server farms and their accompanying database administration and system support personnel.
The advent of big data applications has changed this equation. While quick implementation of a new or upgraded operational system may lower short-term costs, the eventual need to leverage this new operational data by analyzing it in the big data environment means that quality data modeling and data integration are essential. Regrettably, these two areas are often overlooked in the rush to get to production. In this article we will explore the aspects of data modelling that are essential during application development in order to make the application data valuable to the organization.
Rapid Application Design
Some developers use rapid application development techniques to get new operational applications to production quickly. As part of this process they may delegate database design and administration to a third-party firm, which in some cases will store data on a set of networked servers separate from their enterprise. This relieves the developers from the costs of database-related tasks, and shortens development time considerably.
One disadvantage of this technique is that the physical data model (and sometimes the logical data model) are controlled by someone else. This makes changes to the data model more complex and prone to error; in addition, some application logic can be implemented in the data model including enforcement of unique keys, data dependencies, and automatic processes such as triggers. This means that the application logic is split between the local application code and the remote database implementation, leading to increased maintenance and upgrade costs and complexity.
What does this mean for your big data application? Generally speaking, new operational systems have a predictable set of data elements that will be propagated to your big data system. These include entity keys, data elements used for aggregation and transactional information. However, in the case of rapidly developed applications there is a tendency for the data model to change frequently after implementation either to fix errors or add or expand functionality. Typical examples are changing the meaning or usage of certain data elements. For example, OrderStatus initially meant whether the order had been completely entered, and is changed to mean whether products have been delivered. Another example would be OrderDate, originally meaning when any portion of the order was first entered, then changed to mean the data on which the order was completely entered and verified. Finally, an AccountNumber may be expanded from 10 characters to 16 characters.
Big data analytic queries function best with static data models. Whenever the data model changes, not only must analysts code future queries differently but past queries change as well. It is common for an analyst to develop a useful query that is implemented as a regular report. Many reports may require changes or even become invalid when a data model changes.
To avoid these issues during rapid application development, ensure that your data model remains in your control, and that you document any and all business rules and logic that are implemented as a part of the database design. These would include the following:
- Entity integrity rules (e.g. data element uniqueness and primary key definitions, and whether certain data elements are required);
- Data integrity rules such as column constraints (e.g. Salary must be non-negative, DeliveryDate must be greater than OrderDate);
- Referential integrity rules (e.g. any Order must be for a valid Customer);
- Business logic such as triggers, user-defined functions and stored procedures.
Owning the data model should include a sign-off by developers prior to implementation.
Operational Systems Data Access
The first big data applications emphasized storing all data in a hybrid hardware / software environment commonly called an appliance. This required copying operational data to the appliance on a regular basis similar to the processes used to load a data warehouse. Many new model appliances come with high-speed data loading processes that can execute in parallel, shortening the elapsed time required for execution.
Modern big data applications must accommodate several new requirements. One of these, mentioned above, is that of rapid application development. New operational data now exists off-premises in a cloud. Part of the development process must include considerations for extracting data from the cloud for loading into your appliance. Alternatively, it may be possible for analytical software to access the operational data directly in the cloud. This is because some analytical queries are used for real-time decision making. Consider a product ordering application that makes recommendations based upon prior purchases. Historical data may exist in the appliance while the current order data is in the cloud. Another example is fraud prevention, where current purchases are checked against historical data to determine if the customer is legitimate.
Application developers and DBAs must consider data access options during development, including what data will be extracted to the appliance and what data may be accessed directly by analytical queries. These options may affect the database design, especially any performance-related elements such as data clustering, table partitioning, addition of secondary indexes and other choices.
Integrating with Test Environments
New applications must be tested, and a best practice is to create a testing environment with sufficient data to execute the application and confirm it is working correctly. How do you do this in a big data environment? It seems prohibitively expensive to buy and install a separate big data appliance that is only used for testing. In reality, this is what many companies do, and they are able to justify the costs as part of disaster recovery planning.
Beginning big data applications were considered as merely large data warehouses. Since they contain historical data on transactions but little entity-specific data, there was little consideration for disaster planning. After all, on any major disaster the first tasks would be to restore computer hardware, followed by recovering any data. One common method is to have a secondary site with a sufficient amount of hardware and data storage, coupled with methods of copying production data regularly to that site.
As noted previously, many big data queries tend to grow in size and scope and eventually be programmed to execute as regular reports. As more reports produce more usable data, departments begin to use these data for business decisions. Regular reports on customer credit may assist loan officers, aggregated patient histories may affect doctors’ treatments and product sales in certain geographic areas may drive shipping schedules.
The tendency will be for the big data application to be used extensively for real-time decision making and, hence, to be judged as mission-critical. When this happens, a big data appliance is required at the disaster recovery site.
The secondary site now is transformed into a test environment for applications that integrate with big data. New operational applications will use this environment to test extract processes as well as any analytics that may access the new system’s data.
Input to Capacity Planning
New operational systems generate new data, and some of this new data will be stored in the appliance. As data accumulates over time, data storage requirements increase. It is incumbent on the application developer to share the data model with database administrators in order to plan for future storage purchases.
In the case of rapid application development this is complicated by having application storage in an off-premises facility or the cloud. In these cases, you may not control the way the data is stored. On what hardware media is the data stored? Is the data compressed? Does the cloud database have embedded free space? Is the data partitioned, and in what way? What indexes are used and required for data access? These and other questions affect data storage estimates.
There are other considerations as well. Any data extracted for storage in the appliance may require intermediate storage for sorting, data transformations or data conversions. For example, there are methods for storing dates in a proprietary format to save space. When you retrieve an extract of your operational systems data, what format is the data in? Another consideration may be that the data stored using ASCII encoding but your production mainframe processors use EBCDIC encoding.
Still other considerations include whether data in your appliance is further extracted for external data mart users.
As an example, consider one gigabyte (1 GB) of customer transaction data. It may be stored in a compressed format in the cloud, resulting in 0.5 GB of storage used. There may be multiple indexes on the data, adding another 0.5 GB. Your extract process must do data transformations and these require uncompressed data, so your daily extract file must be sized at 1 GB. You require work files for sorting the data, and an output file for loading to the appliance, all of which result in 3 GB of storage. The data may then be loaded to your data warehouse, and extracted by two departmental data marts. Finally, the DBA must plan for database backups and transfer of the data to development and test environments as well as the disaster recovery site.
When all of this is summarized it may be common for 1 GB of operational data to require 10 GB or more of data storage space across the enterprise.
This is why input from new operational systems during development is essential to capacity planning.
Summary
Rapid application design for new operational systems is becoming popular as a cost- and time-saving opportunity. Sometimes forgotten in the process is the need for the development team to coordinate closely with in-house database administrators and capacity planners.
Implement procedures to co-own and review the changes to your data model, especially elements such as triggers that contain business logic. Review options for accessing operational data for extract to your appliance as well as analytical query access. Ensure that you integrate the new application into existing testing environments, especially those with big data components. Finally, realize that new operational data requires much more data storage across the enterprise than is used for local application purposes, and that you must communicate these storage needs to capacity planning specialists.