By Mike Lamble
Today, organizations are awash in Big Data. By “big data” we’re talking about traditional business data such as orders, transactions, customer profiles, as well as new data sources flowing from machines, sensors, and social networks. This is data that is measured in exabytes and beyond. If studies are correct, we may be in for an even greater onslaught because according to IBM, 90% of the world’s digital data has been created in just the last two years.
The waves of big data are massive and arriving continuously, demanding a storage platform that is expansive, economical, and accessible. The massive landing zone requirement is so great that it would have been beyond comprehension only a few years ago.
Enter Hadoop: an affordable, elastic, and truly scalable open-source software stack that supports data-intensive distributed applications. The combination of Hadoop’s distributed file system (HDFS) and the Map Reduce framework has greatly improved how data is stored and manipulated and has helped organizations keep up with the explosive data growth. Businesses of all sizes have contributed to Hadoop’s stunning growth rate with IDC predicting Hadoop to grow 60 percent through 2016 and the market itself increasing from $77 million in 2010 to more than $2.2 billion by 2018. Whether data is unstructured, semi-structured, or structured, the Hadoop environment enables big data to be captured in its native format and parsed later, as necessary.
The business case for storing all of this big data is that it will yield game-changing insights. A recent Harvard Business Review poll found 85 percent of the executives surveyed expect to gain substantial business and IT advantage from big data. The reality is that to deliver on these executive expectations the organization must integrate and analyze big data, as well as collect it.
Hadoop has been a powerful first step in harnessing big data and is ideal for a variety of tasks that are essential to big data analytics. It provides affordable storage on large data volumes, enables fast data ingest and its schema-free orientation makes it suited for storing vast quantities of data in the most granular forms. Good for batch processing, Hadoop supports massive simple ETL (extract-transform-load) while providing the ability to readily scale up to handle more data or to shorten processing time.
Despite these benefits and being the defacto tool of choice for big data collection, a growing portion of advanced big data practitioners are finding that Hadoop tools are marginal at best for big data integration and analytics of structured data. Relative to expectations of the business intelligence community, Hadoop query times are slow, its access methods are arcane, it isolates data subject areas from each other, and it lacks a rich third-party ecosystem of tools for analysis, reporting, and presentation.
DBMS Analytics at Hadoop Scale
In comparison to Hadoop and its noSQL off-shoots such as Hive, HBase, Casandra, Pig, et al., big data DBMS (database management system) solutions – i.e., SQL engines — are enticing for a variety of reasons. Compared to Hadoop solutions, the big data DBMS solutions have fast query response times and make it easy to join disparate data. SQL skills are ubiquitous, and the third party tool ecosystem is robust.
In order to derive maximum business value from big data, it needs to be integrated with existing legacy data from corporate systems and external data providers. These data sets, in overwhelming proportion, reside in SQL data stores, such as data warehouses, data hubs, and operational data stores.
But Hadoop has raised expectations about cost, scale, elasticity, and scalability. These are areas where Hadoop shines. By providing a platform for ingesting and storing big data, Hadoop has defined the features of the next generation analytics DBMS. Therefore, to perform at Hadoop scale, today’s DBMS for big data analytics must meet these requirements:
- Deploys on standard (Linux) hardware: IT departments are standardizing on flavors of relatively inexpensive “commodity” x86 hardware. These environments will continuously grow and evolve in sync with the evolution of the x86 processors from Intel and AMD.
- Deploys at unlimited scale on cloud and virtualized environments: Whether through public clouds, such as Amazon AWS, or virtualization achieved through private clouds, the advantages of sharing resources – CPU, storage, network – are overwhelming. Cloud and virtualized environments reduce the time-to-market for new solutions, improve economic efficiency through resource sharing, and enable pay-for-use models.
- Scale linearly to petabytes: Easier said than done. Software that achieves linearly scalability from terabytes to petabytes, on read-write workloads, must be deeply multi-threaded, involve peer-to-peer communication to avoid head node bottlenecks, and rely on a shared nothing architecture. The emerging standard is that resources can be added/taken away at arbitrary scale, from 1 to n (100, for example) processor nodes.
- Be quickly deployable: Spinning up new environments must be achievable in hours or days, rather than weeks or months. Adding compute resources needs to be achievable on the fly without requiring re-partitioning/re-loading of data. Similarly, resources need to be removable and re-deployable with the same ease.
- Provide a streaming ingest interface to new data sources, such as Hadoop noSQL data stores: Data volumes are so large that continuous ingest is a requirement, simultaneous with queries, without the need for table-locking.
Challenge and Opportunity
For established vendors in the big data DBMS analytics niche the new requirements cause heartburn because their offerings are so tightly coupled with proprietary hardware. But a new and emerging class of DBMS providers is meeting these new requirements. This new breed of SQL engines are radically scalable, cloud-enabled, and always on, allowing organizations to ride the wave of today’s big data deluge.
About the Author
Mike Lamble is President of XtremeData, Inc., providers of the only high-performance DBMS for big data analytics deployable on premise and in the cloud. For more information contact the author [email protected]