There is no silver bullet in the database world. Which database to choose depends on who you are, what you’re trying to achieve, how much data you want to store, which OS/Language platform you use for applications, your budget, whether you need a data warehouse, a BI or decision supporting system on top of it, etc.
Background
Many friends, developers, application and system architects
and sometimes even system administrators often ask me the same types of questions:
“Bo, should I use MS SQL or Oracle database for my application?”, “Bo, should I
migrate my Sybase databases to MS SQL?”, “Which database (platform) is the
best?”, “Why don’t we use open source databases such as MySQL or PostgreSQL?”
etc. All of these questions are essentially the same–which database platform
fits the best in the application, or which database platform makes the most
sense for the IT and business?
To answer any of those questions, I’d ask them a
dozen more questions before I could potentially recommend anything. I’ve been
a database professional for one and a half decades, and often times people
don’t understand why a simple question from them would trigger so many
questions from me. Hence, I’ve developed a reputation of over engineering, or
over analyzing.
Many times, the interested person didn’t even
know why s/he even needed a database, what a relational database is, and what
the differences (the pros and cons) of a spreadsheet, an Access database, or a
MS SQL database are, or what the application can benefit from housing the
database on a RDBMS platform.
While I’m not here to preach about what a
database is, as I already know it means different things to different IT/non-IT
people. I don’t want to be called a database-centric nerd, even though
sometimes I don’t think I can avoid that complement. However, I do think in the
21st century, people have been talking about BI (business
intelligence) and I also think the most flexible, most complicated, and most
maintainable systems are data driven. Therefore, I’ll just give a brief
introduction of the most popular database platforms that are available, and
what people use them for, based on my experience and interaction with other
database pro’s in this messy IT world.
Below are the databases I personally have
supported/developed since I was in college about nineteen years ago:
- dBase/FoxBase/FoxPro in the DOS/Novell/Windows 3.x world
- MS SQL Server (6.5,7.0,2000,2005 and 2008)
- Oracle (7, 8/8i,9i,10g)
- Sybase SQL/ASE Servers (4.9.2, 10, 11, 11.9.2, 12/12.5, 15.x)
- Sybase IQ 12.6/12.7
- Informix/Dynamic (6, 7)
- MySQL (4, 5/5.1)
- PostGreSQL (8.3.5)/EnterpriseDB.
In addition, I’ve participated in the evaluation of time
series databases, such as Vhayu and OneTick. I also have a little experience with
in-memory databases used in the trading industry, such as Oracle TimesTen and
Sybase ASE in Memory version (one component of Sybase RAP trading edition,
which has become a strong competitor in the time series DB arena, especially
since they integrated with CEP earlier last year).
I’ve worked as either as a full-time employee or consultant
for different companies on different projects, and really think there has been
no silver bullet in the database world. Which database to choose really depends
on the following criteria:
- Who you are
-
What you’re trying to achieve (business/functionality
requirements, performance/reliability/scalability/availability requirements) -
How much data you want to store in the current databases before
archiving them - Which OS/Language platform you want to choose for the application
- How much money you can budget for it
-
Whether you want/need to build a data warehouse, a BI or decision
supporting system on top of it eventually, etc.
Comparing different database systems is just like comparing
different development platforms and OS platforms:
- Java/J2EE to (C#, VB)/.NET
- Windows to Linux/Unix
-
Linux to proprietary Unix systems, such as
AIX/HP-UX/Solaris/Tru64/Irix - Unix to OpenMVS/UniSys/Tandem/IBM Mainframe
If we do not need the database to be in the critical path
(due to concerns such as locking/blocking, latency, slow I/O, etc.) of any
front office application systems such as order entry, banking or
telecommunication systems, and we just want the database to store the after
fact data or configuration/reference information as a data repository, or if we
want it as either a backup/alternative way of saving data in database rather
than in files, then virtually any RDBMS that support a distributed system will
do:
- MS SQL
- Oracle
- Informix
- Sybase
- TeraData
- DB2
- MySQL
- PostGreSQL, etc.
Of course, they all provide different features and have
their own flavor of SQL language, even though they all claim to comply with the
ANSI SQL , SQL-92 standards, etc. Again, this is just like
Websphere/WebLogic/Oracle App Server/Sybase EServer/JBoss–all different
implementations of J2EE, and they all have to be compliant and compatible with
J2EE specs, yet with their own flavor and expansion.
Mission-critical (OLTP) and High Performance System
However, if we need the database to play a more critical
role in a mission/business-critical system, and we want everything, such as no
down time/High Availability/clustering, reliability/point-in-time recovery,
scalability (for both performance and data volume), replication for DR(disaster
recovery) and reporting purposes, with extreme high speed and low latency (HPC),
then we will probably need an RDBMS with enterprise features, which will incur
high cost—e.g., for some of the examples in the TPCC results, the total system cost
could be up to multi-million. Below is a comparison of the database platforms
based on those critical, high performance enterprise features.
HA Clustering Features
The clustering capabilities of
various systems are as follows:
-
Oracle RAC is the best for HA clustering and load balancing and
scalability. - MS SQL Server Cluster provides HA failover but no load balancing.
- Sybase ASE 15 Cluster achieves the same thing as Oracle RAC.
- TeraData is a networked/distributed DB platform since birth.
-
MySQL Cluster claimed higher TPC-C performance benchmarks than
Oracle RAC. -
IBM DB2/UDB works very well with IBM’s pSeries/Power server
running AIX HACMP(High Availability Cluster Multiprocessing), and Veritas
VCS/Sun Cluster etc.
High performance/low latency/In memory DB
- Oracle provides TimesTen in-memory database.
- MySQL’s cluster uses an in-memory storage engine.
- Sybase provides ASE in memory database.
-
Alternatively, we can use RAMSAN/SSD (Solid State Disk) to house
MS SQL databases, then all the database related disk I/O will happen to the SSD
rather than to traditional physical disks, hence we can dramatically reduces
disk I/O latency and increase I/O throughput, and improve overall database
performance. -
There are other in memory DBs, such as SQLite or eXtremeDB, but
most of them either only support embedded systems or just support a single
user/connection at a time.
Scalability/Performance
-
With Oracle RAC and Sybase ASE 15 Cluster, if the existing
hardware cannot keep up with the performance requirement, we can simply add a
beefier box to the cluster and gradually replace all other nodes, which
protect our existing investment better. -
With MS SQL Cluster, since it only supports Active/Passive mode
(instance wise), there is no scalability at the instance level. We have to
replace the whole cluster (or pair by pair), if the existing hardware has been
throttled. -
MySQL cluster also supports adding nodes to the cluster with all
nodes being active, but it is a share nothing cluster. More nodes mean more
copies of databases, more storage and more network traffic for data replication
among all nodes in the cluster.
Replication
-
Sybase. (OK, not to sell Sybase, because people think Bo
is just a Sybase guy). Sybase has the best replication solution, which out
performs Oracle’s Data guard, or MS SQL Server’s replication (even with its
mirroring/log shipping/transactional replication, MS SQL’s replication is still
far from Sybase’s replication server). -
MS SQL. MS SQL replication has log shipping and database
mirroring for warm/standby DB solutions (active/passive, r/w and read-only),
and it supports transactional replication with the granularity at the article/object
level (per table). -
Oracle. Oracle’s Data Guard: One of the best ways for
database cloning and one of the best DR solutions (using RMAN). -
MySQL. MySQL provides shared nothing cluster, which
basically is synchronous replication among all active nodes. And its asynchronous
replication also supports at the transaction and object level. -
Storage level solution. With EMC’s BCV, NetApp’s Snap
Manager, Hitachi’s snapshots, Veritas’ volume replicator, data replication can
happen at the disk block/volume level. However, the replicant DB/volume is
offline all the time until mounted to OS and presented to DB servers.
Sybase’s replication is one of the main reasons why
Sybase ruled Wall Street in the 1990s and is still used by many large shops (e.g.,
Citadel, Goldman Sachs, Bank of America, UBS, Chase, Barclays, etc.) for both
heterogeneous replication and homogeneous replication.
Without wasting too many words about Sybase
replication, Sybase replication can be implemented at the database level
(active/active replication, better than SQL’s mirroring), and transaction level
(like SQL’s transaction replication) with RPC, table level, stored proc level,
function string level support, and MSA (multiple-site availability), and of
course, log shipping (poor man’s solution, in the replication world). Its
mirror activator, in conjunction with EMC/SRDF, literally ensures true ZDL
(zero data loss) for both planned and unplanned system outages.
Hybrid/Mixed Systems
Typically, middle office and back office databases house a
great deal of large volume data, and performance issues usually appear during
data retrieval (data sorting, grouping, summarizing, and calculating), which
when running slowly will in turn affect the data writing process.
Therefore, for those types of applications, we want a
database that handles large volume data very well and can also handle the reader/writer
blocking issue very well.
Oracle
Oracle is the best for this type of system, because
in Oracle, reader and writer don’t block each other; yet there’s no dirty read
for the reader and the data during the whole reading
process/session/transaction is consistent. The overhead happens at the memory
level by playing with latch and redo logs.
MS SQL
MS SQL 2005’s new snapshot isolation using row
versioning closely achieved what Oracle has, and the overhead is that there is
more I/O incurred in tempdb.
Sybase ASE
Sybase ASE still has the reader and writer block
each other. However, Sybase IQ has resolved the blocking issue by using
snapshot versioning as well.
Other Databases
There are other databases that don’t have the
blocking issue, but some of them are not even ACID (atomicity, consistency, isolation,
durability) compliant.
Data Warehouse
Companies build data warehouses to store/archive all
historical data, and mine whatever data they feel needs to be put in database.
Typically, data is denormalized into the data warehouse, and multi-dimensioned
to provide fast retrieval for OLAP and business intelligence (BI) solutions.
Data warehousing is also useful for back testing to come up
with new or enhanced business strategies. In addition, data grows exponentially
in a data warehouse (cube) instead of linearly.
Advantages of Column-based RDBMS
Column-based RDBMS has an advantage now because
column-based databases are multi-dimensioned by nature, and each column is
self-indexed (B tree).
Examples of typical column-based databases are
Google’s BigTable, Sybase IQ (data store in Sybase RAP), Vertica, and KickFire,
which is built on MySQL.
On average, these column-based databases are 60 to
100 times faster than traditional row-based RDBMS (e.g., Oracle, MS SQL, Sybase
ASE, TeraData, DB2, MySQL) for data retrieval. However, they are slower for row
level based data updating (much faster for large amount of data load, or batch
load though). So far, Sybase IQ seems to be the leader in this world, which
claims to house the largest data warehouse in the world with a size of over 1
PetaByte.
Additional information about these systems is
available on their respective web sites.
- http://www.kickfire.com/Solutions/Resource-Center
- http://www.sybase.com/products/datawarehousing/sybaseiq
More about Column-based RDBMS
Both Oracle and MS SQL need to build cubes to handle
multi-dimensioned data for data warehousing, which incurs a lot more additional
storage and design work as well. Personally, I think row-based RDBMS data
warehouses will soon be replaced by column-based data warehouses.
Overall, row-based RDBMS (MS SQL, Oracle, Informix,
DB2/UDB, MySQL, Sybase ASE, etc.) is better for OLTP, while column-based RDBMS (Sybase
IQ, KickFire, Vertica, etc.) is better for OLAP/DSS and data warehouse. Note:
those time series databases such as OneTick, kdb+ and Vhayu are not quite relational
databases, but their storage structure are based on files on the file systems
with each column as a separate file, which essentially is column-based. Column-based
database is also useful for back testing to come up with new and enhanced
trading strategies. In addition, data grows exponentially in data warehouse
(cube) based on row-based RDBMS, as opposed data only growing linearly in data
warehouses based on column-based RDBMS.
Traditionally, Oracle has had an advantage in the
data warehouse world, but now it is losing edge to column-based RDBMS, such as
Sybase IQ ,Vertica and KickFire built on MySQL. Google is using BigTable,
another column-based database/storage system.
Migration Thoughts
Supportability and existing investment in database
technology is another thought.
I’ve seen quite a few successful migrations from Sybase to
SQL and vice versa, but haven’t seen many from Sybase/MS SQL to Oracle or vice
versa. Depending on the database size and application scope and coupling with
the databases, it could take a few consultants and full-time employees years to
migrate Sybase/MS SQL stored procedures, rewrite applications and test them,
even with the help of Oracle Migration Workbench (now we can use SQL
Developer), as each and every stored procedure will have to be literally
rewritten. The migration is easier if the applications only use
ODBC/JDBC/ADO.NET and no stored procedures/temporary tables.
Native support of programming Languages
Having said the above, Oracle does have the best
supportability/features for Java. We can store Java code in Oracle and treat it
as a native Oracle function, which makes Oracle expandable (kind of like open
source). Moreover, Sybase supports native Java code, too. (Sybase also has
OpenServer, which is used to build event-driven applications for trading firms.)
Both MS SQL/Sybase support extended stored procedures, which essentially is an
interface within the DB calling external dlls written in different programming
languages.
Cost of Ownership
When we’re looking at all of the features of those databases,
one thing to keep in mind is the cost and time frame. Even if you have an
experienced database team, it doesn’t make much sense to spread across more
than three database platforms. Additionally, it doesn’t make much sense to
consolidate them all and migrate to one database platform, as most of the
heterogeneous databases can interact with one another through MS SQL’s linked
server, Oracle Gateway, or with some database replication and ETL process in
place. When you consider the cost of maintaining a certain database, never
isolate the database cost. Rather, consider it as an essential part of your
application, your infrastructure.
I hope this article is a help for those people who will or
are already in the process of picking a database for their applications.