Choosing a Database Platform
March 15, 2010
There is no silver bullet in the database world. Which database to choose depends on who you are, what youre trying to achieve, how much data you want to store, which OS/Language platform you use for applications, your budget, whether you need a data warehouse, a BI or decision supporting system on top of it, etc.
Many friends, developers, application and system architects and sometimes even system administrators often ask me the same types of questions: Bo, should I use MS SQL or Oracle database for my application?, Bo, should I migrate my Sybase databases to MS SQL?, Which database (platform) is the best?, Why dont we use open source databases such as MySQL or PostgreSQL? etc. All of these questions are essentially the same--which database platform fits the best in the application, or which database platform makes the most sense for the IT and business?
To answer any of those questions, Id ask them a dozen more questions before I could potentially recommend anything. I've been a database professional for one and a half decades, and often times people dont understand why a simple question from them would trigger so many questions from me. Hence, I've developed a reputation of over engineering, or over analyzing.
Many times, the interested person didnt even know why s/he even needed a database, what a relational database is, and what the differences (the pros and cons) of a spreadsheet, an Access database, or a MS SQL database are, or what the application can benefit from housing the database on a RDBMS platform.
While Im not here to preach about what a database is, as I already know it means different things to different IT/non-IT people. I dont want to be called a database-centric nerd, even though sometimes I dont think I can avoid that complement. However, I do think in the 21st century, people have been talking about BI (business intelligence) and I also think the most flexible, most complicated, and most maintainable systems are data driven. Therefore, Ill just give a brief introduction of the most popular database platforms that are available, and what people use them for, based on my experience and interaction with other database pros in this messy IT world.
Below are the databases I personally have supported/developed since I was in college about nineteen years ago:
In addition, I've participated in the evaluation of time series databases, such as Vhayu and OneTick. I also have a little experience with in-memory databases used in the trading industry, such as Oracle TimesTen and Sybase ASE in Memory version (one component of Sybase RAP trading edition, which has become a strong competitor in the time series DB arena, especially since they integrated with CEP earlier last year).
I've worked as either as a full-time employee or consultant for different companies on different projects, and really think there has been no silver bullet in the database world. Which database to choose really depends on the following criteria:
Comparing different database systems is just like comparing different development platforms and OS platforms:
If we do not need the database to be in the critical path (due to concerns such as locking/blocking, latency, slow I/O, etc.) of any front office application systems such as order entry, banking or telecommunication systems, and we just want the database to store the after fact data or configuration/reference information as a data repository, or if we want it as either a backup/alternative way of saving data in database rather than in files, then virtually any RDBMS that support a distributed system will do:
Of course, they all provide different features and have their own flavor of SQL language, even though they all claim to comply with the ANSI SQL , SQL-92 standards, etc. Again, this is just like Websphere/WebLogic/Oracle App Server/Sybase EServer/JBoss--all different implementations of J2EE, and they all have to be compliant and compatible with J2EE specs, yet with their own flavor and expansion.
Mission-critical (OLTP) and High Performance System
However, if we need the database to play a more critical role in a mission/business-critical system, and we want everything, such as no down time/High Availability/clustering, reliability/point-in-time recovery, scalability (for both performance and data volume), replication for DR(disaster recovery) and reporting purposes, with extreme high speed and low latency (HPC), then we will probably need an RDBMS with enterprise features, which will incur high coste.g., for some of the examples in the TPCC results, the total system cost could be up to multi-million. Below is a comparison of the database platforms based on those critical, high performance enterprise features.
HA Clustering Features
The clustering capabilities of various systems are as follows:
High performance/low latency/In memory DB
Sybase's replication is one of the main reasons why Sybase ruled Wall Street in the 1990s and is still used by many large shops (e.g., Citadel, Goldman Sachs, Bank of America, UBS, Chase, Barclays, etc.) for both heterogeneous replication and homogeneous replication.
Without wasting too many words about Sybase replication, Sybase replication can be implemented at the database level (active/active replication, better than SQL's mirroring), and transaction level (like SQL's transaction replication) with RPC, table level, stored proc level, function string level support, and MSA (multiple-site availability), and of course, log shipping (poor man's solution, in the replication world). Its mirror activator, in conjunction with EMC/SRDF, literally ensures true ZDL (zero data loss) for both planned and unplanned system outages.
Typically, middle office and back office databases house a great deal of large volume data, and performance issues usually appear during data retrieval (data sorting, grouping, summarizing, and calculating), which when running slowly will in turn affect the data writing process.
Therefore, for those types of applications, we want a database that handles large volume data very well and can also handle the reader/writer blocking issue very well.
Oracle is the best for this type of system, because in Oracle, reader and writer don't block each other; yet there's no dirty read for the reader and the data during the whole reading process/session/transaction is consistent. The overhead happens at the memory level by playing with latch and redo logs.
MS SQL 2005's new snapshot isolation using row versioning closely achieved what Oracle has, and the overhead is that there is more I/O incurred in tempdb.
Sybase ASE still has the reader and writer block each other. However, Sybase IQ has resolved the blocking issue by using snapshot versioning as well.
There are other databases that don't have the blocking issue, but some of them are not even ACID (atomicity, consistency, isolation, durability) compliant.
Companies build data warehouses to store/archive all historical data, and mine whatever data they feel needs to be put in database. Typically, data is denormalized into the data warehouse, and multi-dimensioned to provide fast retrieval for OLAP and business intelligence (BI) solutions.
Data warehousing is also useful for back testing to come up with new or enhanced business strategies. In addition, data grows exponentially in a data warehouse (cube) instead of linearly.
Advantages of Column-based RDBMS
Column-based RDBMS has an advantage now because column-based databases are multi-dimensioned by nature, and each column is self-indexed (B tree).
Examples of typical column-based databases are Google's BigTable, Sybase IQ (data store in Sybase RAP), Vertica, and KickFire, which is built on MySQL.
On average, these column-based databases are 60 to 100 times faster than traditional row-based RDBMS (e.g., Oracle, MS SQL, Sybase ASE, TeraData, DB2, MySQL) for data retrieval. However, they are slower for row level based data updating (much faster for large amount of data load, or batch load though). So far, Sybase IQ seems to be the leader in this world, which claims to house the largest data warehouse in the world with a size of over 1 PetaByte.
Additional information about these systems is available on their respective web sites.
More about Column-based RDBMS
Both Oracle and MS SQL need to build cubes to handle multi-dimensioned data for data warehousing, which incurs a lot more additional storage and design work as well. Personally, I think row-based RDBMS data warehouses will soon be replaced by column-based data warehouses.
Overall, row-based RDBMS (MS SQL, Oracle, Informix, DB2/UDB, MySQL, Sybase ASE, etc.) is better for OLTP, while column-based RDBMS (Sybase IQ, KickFire, Vertica, etc.) is better for OLAP/DSS and data warehouse. Note: those time series databases such as OneTick, kdb+ and Vhayu are not quite relational databases, but their storage structure are based on files on the file systems with each column as a separate file, which essentially is column-based. Column-based database is also useful for back testing to come up with new and enhanced trading strategies. In addition, data grows exponentially in data warehouse (cube) based on row-based RDBMS, as opposed data only growing linearly in data warehouses based on column-based RDBMS.
Traditionally, Oracle has had an advantage in the data warehouse world, but now it is losing edge to column-based RDBMS, such as Sybase IQ ,Vertica and KickFire built on MySQL. Google is using BigTable, another column-based database/storage system.
Supportability and existing investment in database technology is another thought.
I've seen quite a few successful migrations from Sybase to SQL and vice versa, but haven't seen many from Sybase/MS SQL to Oracle or vice versa. Depending on the database size and application scope and coupling with the databases, it could take a few consultants and full-time employees years to migrate Sybase/MS SQL stored procedures, rewrite applications and test them, even with the help of Oracle Migration Workbench (now we can use SQL Developer), as each and every stored procedure will have to be literally rewritten. The migration is easier if the applications only use ODBC/JDBC/ADO.NET and no stored procedures/temporary tables.
Native support of programming Languages
Having said the above, Oracle does have the best supportability/features for Java. We can store Java code in Oracle and treat it as a native Oracle function, which makes Oracle expandable (kind of like open source). Moreover, Sybase supports native Java code, too. (Sybase also has OpenServer, which is used to build event-driven applications for trading firms.) Both MS SQL/Sybase support extended stored procedures, which essentially is an interface within the DB calling external dlls written in different programming languages.
Cost of Ownership
When were looking at all of the features of those databases, one thing to keep in mind is the cost and time frame. Even if you have an experienced database team, it doesnt make much sense to spread across more than three database platforms. Additionally, it doesnt make much sense to consolidate them all and migrate to one database platform, as most of the heterogeneous databases can interact with one another through MS SQLs linked server, Oracle Gateway, or with some database replication and ETL process in place. When you consider the cost of maintaining a certain database, never isolate the database cost. Rather, consider it as an essential part of your application, your infrastructure.
I hope this article is a help for those people who will or are already in the process of picking a database for their applications.