by Amit Satoor
We are living in the age of exponential data volume growth where trillions of transactions, events or activities are generated through the use of online trading, online shopping and the use of mobile devices. The global economy is creating large volumes, as well as variety, of data at an increasingly faster pace. In addition to the transactional data generated through interaction with customers, suppliers and operations, a tremendous amount of data is being generated through web logs, RFID sensors, unstructured textual information from social networks, mobile phones, smart energy meters and other industrial machines. In addition to organic data growth, compliance requirements for on-demand accessibility and longer retention periods are making high volume data management a necessity. And compliance isn’t the only reason to retain large data sets; better historical business reference and real-time analysis, all require access to large volumes of data to gain better business insights.
The Requirements Have Changed but Existing Systems Can Often Keep Pace
The question is whether the requirements for handling large and growing data sets, as well the need to manage unstructured data require a new technology platform or are existing platforms adequate? In other words, should enterprises look for new products and tools or extend the use of what they already have.
Managing large data volumes has been a key requirement for some enterprises for decades. For example, the financial services industry has been using high-volume mission critical systems to manage transactional data for years. The globalization of business combined with growth in mobile collaboration and commerce is now making this a requirement for other industries, such as retail, healthcare and online commerce. Existing data management products that have supported these enterprises are now adding new features to support the new “big data” requirements. Indeed, the platforms enterprises are currently using are often adequate to manage the growing data volume and variety.
Compression Technology
Storage optimization techniques such as data compression provide an effective solution to store and retrieve petabytes of raw data. Data compression lets you use less storage space for the same amount of data, helping to keep your operational costs from escalating as your data grows.
The challenge for these systems is ensuring that data can also be accessed on demand without a high performance penalty. Commercial products such as Sybase Adaptive Server Enterprise (ASE) with compression technology can help customers manage higher data volumes while maintaining the performance and reliability of the infrastructure. Previously ASE had the ability to compress backups, reducing offline storage costs. Now ASE allows in-database compression for active data sets. Both regular relational data as well as unstructured data (large objects or LOBs) can be compressed. Early internal benchmarks indicate that compression ratios can readily hit the 40-80%.
Sybase ASE uses a number of compression strategies to achieve high compression ratios. This includes compression within a single row to compress away empty spaces/zeros in fixed length columns. At a page/block level this includes both page dictionary and page index compression strategies. Repeated data items as well as repeated sets of data items are replaced by a single reference resulting in dramatic savings for duplicated data. Data can be compressed at the database, table and partition level. This allows for customers to make compression choices separately for active or inactive data.
After you create a compressed table or partition, Adaptive Server compresses any subsequently inserted or updated data. If Adaptive Server cannot efficiently compress the inserted data, the original row is retained. If newly inserted or updated LOB data occupies space that is smaller than or equal to a single data page, Adaptive Server does not compress the data. Tables can be a mixture of compressed and uncompressed data. If you create a compressed table, load data, then disable data compression for the table, previously inserted data is compressed, but rows added after you disable compression are not compressed. You don’t need to uncompress data to run queries against it. You can insert, update, and delete compressed data. Because there is less data for Adaptive Server to search, there are fewer I/Os, improving the efficiency of data storage.
ASE provides different levels of compression for regular and LOB data. Generally, higher compression ratios use more CPU when you decompress the data. ASE selects compression levels based on how the data is accessed. Data that you access frequently (“hot data”) may be best suited for compression levels that have smaller CPU overhead.
Database deduplication is a specialized data compression technique for eliminating coarse-grained redundant data sets. In the deduplication process, data patterns are identified and stored during a process of analysis. As the analysis continues, other patterns are compared to the stored copy and whenever a match occurs, the redundant pattern is replaced with a small reference that points to the stored pattern. Deduplication provides higher compression rates but presupposes that you may have many copies of the same object that exhibits the same characteristics. Certain types of structured and semi-structured data do have similar characteristics—transactions, call data records, log entries or events have repeated data values and patterns that are common across individual and groups of records. These can be de-duplicated so that only unique entries are retained. This type of deduplication is different than that of storage-based deduplication where entire files and large chunks of data are inspected for deduplication.
ASE data compression lets you use less storage space for the same amount of data, reduce cache memory consumption, and improve performance because of lower I/O demands.
LOB Management
Modern applications need to manage and manipulate unstructured data in addition to transactional data. Most commercial databases normally store this big data in LOB columns (e.g. text, images, unitext, XML). LOBs can vary in size from a few bytes to gigabytes.
New enhancements in ASE have redesigned LOB support for cases when LOBs are fairly small and can readily fit within parent row. This has two advantages: storage savings by optimizing for the space allocated for small LOBs, and performance benefits for retrieval of small LOBs by reducing the I/O overhead. Sybase ASE seamlessly handles the migration of LOBs from in-row to off-row storage (or vice versa) to optimize for the storage required as LOBs change in size due to updates.
Sybase ASE supports the storage of in-row LOB columns for text, image, and unitext data types when they are small, and subject to available space in the page. When a LOB expands in size or its space is used for other in-row columns, Adaptive Server seamlessly migrates the in-row LOB data to off-row storage, automatically replacing the data with an in-row text pointer.
In the past, ASE stored LOB columns (such as text, image, unitext, and XML) off-row, using a text pointer (txtptr) to identify the start page ID value, and storing the pointer in the data row. However ASE now allows you to declare a LOB text, image, or unitext datatype for a local variable, and pass that variable as an input parameter to a stored procedure. Once LOBs are declared, Adaptive Server stores SQL statements containing LOBs. This provides an intelligent handling of large objects.
New enhancements in ASE also support in-database LOB compression. Given that LOBs can be very large in size (up to 2GB), compression can result in very significant space savings. FastLZ and ZLib compression techniques are lossless compression algorithms. While the former provides lower CPU usage and execution times, the latter provides higher compression ratios.
Summary
Proven commercial data management products such as Sybase ASE have added a number of enhancements to handle large and growing data sets. New technology enhancements in the areas of compression and LOB management allow you to manage big data challenges while ensuring the security and reliability of your mission critical systems. These enhancements allow you to reduce storage costs while ensuring high performance on even the largest data sets. For information please visit www.Sybase.com/ASE
About the author
As Senior Group Manager of Marketing, Amit Satoor leads strategic direction and positioning of data management products. Amit Satoor brings more than 15 years of experience developing, marketing and delivering new technology solutions in both domestic and international markets. Prior to Sybase (an SAP Company), Amit held senior product management positions at Netscape, Novell and Firedrop where he was responsible for bringing to market next generation internet solutions. He holds a Masters in Business Administration and Computer Science.