Business intelligence applications are moving from the traditional connection to an OLAP Data source based on relational database systems (RDBMS) to the ability to link to and consume data from a variety of disparate sources including social networks. The ability for a modern BI application to be able to use mashups of data to provide agility when dealing with integrations of multiple types of data sources has led to NoSql being promoted by many as the next big thing within BI. Does this mean that we have seen the end of the SQL style RDBMS system within the BI area – there are many pros and cons for both systems but I believe that there is still a place for both within the BI arena.
NoSQL
In computing, NoSQL is a term used to designate database management systems that differ from classic relational database management systems in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset. Notable production implementations include Google’s Big Table, Amazon’s Dynamo and Apache Cassandra.
NoSQL implementations like Cassandra and Dynamo can scale out past the terabyte and on to the petabyte size by utilizing horizontal scaling and multiple nodes and in particular, the costs differences associated between SQL and NoSQL implementations are significant. However, each type of NoSQL system uses its own proprietary code for its connections and the system is usually set up for a particular model, which enables super fast performance but does hinder the ability to run any adhoc queries on the data.
Implementing NoSQL in Business Intelligence Applications
I have found recently that many of my clients are looking to connect to social networking data to enable them to trend sales and customer selections. This data is very unstructured and most is in the form of NoSQL (Twitter, Facebook, etc). In a normal business environment the scalability and ACID (atomicity, consistency, isolation, and durability) properties of a traditional RDBMS system is perfect for supporting an online EPOS system or transactional data which is updated once a day. Even if this type of data starts to scale up to the terabyte size the data can be denormalized via ETL processes and flattened into a star schema, which can then be accessed quickly. However, when the data size grows above the terabyte and towards the petabyte, a more efficient methodology (instead of scaling up with all the increased purchase costs of new hardware, etc.) is to scale out to multiple cheap data nodes. This is very difficult to achieve utilizing a traditional RDBMS but is how NoSQL is designed to work.
The problem as I see it for most major business clients is who within their organizations to use to implement a NoSQL BI solution. Most of the requirements of a BI system – large data sets, speedy recovery of data, and display of results to all business users – can be implemented utilizing a NoSQL data set; however, the technology does require a different type of technical resource. I have a background in Application Development and have skills in Java, which is the tool of choice for many NoSQL database implementations – this is required to access the tools API to deliver results and to build the data model. However most current technical staff involved in BI projects in the corporate world will be skilled in SQL style RDBMS applications (Oracle, MS SQL, Microstrategy). One possible solution to this problem could be the Toad for Cloud database application by Quest Software, which I am just starting to look at in more detail – this shows great promise in its ability to interrogate cloud style NoSQL databases like Cassandra, HBase and Azure with SQL terminology and to allow transfers of data between NoSQL databases and SQL databases.
Benefits and Drawbacks
I have found benefits in using both Azure and Cassandra when it comes to the more simple tasks normally associated with database management and this, coupled with the ability to scale out instead of up with all the associated costs, does show that NoSQL has a place within the BI world. Working with Azure has shown me that it is possible to provide the client with a very scalable solution, which is not prohibitively expensive and can provide the results that they are seeking. Problems with NoSQL applications lie in the fact that they are not all of the same language and because most of them are written in interpretive language they do lack speed. However, they are easy to use by non programming personnel as their concepts of using the ‘row’ operator to select rows of data and the ‘column’ operator to select columns is extremely straight forward.
The trade off for NoSQL database is their lack of ACID and their ability to support adhoc querying. Utilizing SQL RDBMS allows us to use standard connections between servers and clients especially those stalwarts of BI Reporting, crystal reports or business objects. It also allows for clean easy connections when utilizing the most popular of object frameworks like dot Net or xml. NoSQL does not have a standard of language amongst the variations and this is a problem that I am constantly aware of when pitching the idea of utilizing a NoSQL database to the client. Normal IT departments normally have at least one SQL data access language expert in their ranks – this allows them to at least understand a BI implementation based on a SQL RDBMS. Those customers of mine that are primarily looking for a BI implementation based on EPOS data and do not require large amounts of storage are, I believe, not suitable candidates for an implementation, which includes a NoSQL database. However I have clients, whose businesses are based on either social or web 2.0 based technologies and here I have advised that they look at delivery utilizing cloud based NoSQL database solutions. I identify these companies by their requirements, which usually include a single use case statement and a requirement of access to large amounts of data. This is not to say that RDBMS cannot scale, but it does have its caveats. When reporting a particular item and having designed a few expensive queries, which incur a large overhead, you can easily distribute these between your system’s nodes ensuring lesser writes to the database however; I have yet to see a good model for scaling horizontally within a RDBMS.
Conclusions
In respect to BI, my experience has led me to believe that for the majority of EPOS based customers utilizing a RDBMS SQL based application with the possibility of a normalized star based data warehouse will suffice and provide both transactional integrity and the ability to scale as required. There will of course be exceptions to this model including both the requirement to scale out past the 1 Petabyte mark and a requirement for superfast results and it is at this point that I believe the NoSQL solutions can and should be investigated. I believe that both SQL and NoSQL applications will be implemented side by side in many organizations in the future especially as the drive to include social networking data in our results is realized. Many BI specialists including myself already utilize a plethora of specialized tools to deliver results to the customer – I cannot see any reason for not adding NoSQL into the toolbox.