Friday, September 20, 2013

Relational or non relational databases? That is the question.

We’ve all been there. Every system designer comes across one question…and every database (DB) admin is faced with consequences from bad decisions. Namely, how do we decide on using databases, which types should we go with, and how can we make sure we make the best of what we have by optimization? Here, we’ll review and answer these questions, and invite your feedback at the end. Let us know what works FOR YOU in terms of database programming and share your tips or experience in choosing DB structures in the comments section.

WHAT IS A DATABASE?

Put in simple terms, databases are nothing more than data containers. They organize data in datasets of rows and columns. They also give the user the ability to search for particular data based on different properties of data. Databases come in two major denominations: relational and non-relational. Going into definition details for each of these is way beyond the scope of this blog post, so we will only mention the primary differences.

1. Relational databases. A relational database manipulates only tables and the result of all operations are also tables. The tables are sets, which are themselves sets of rows and columns. You can view the database itself as a set of tables.

2. Non-relational databases. The non-relational flavor of databases, on the other hand, are based on a “navigational” model: a hierarchy, a linked list, a B-Tree, etc. It’s common to refer to these as ISAM (Indexed Sequential Access Method) Databases.

YOU NEED A DATABASE. SO, WHAT NEXT?

So, having concluded a definite need for a database how do we chose the best one for our applications? Well, first of all it mostly depends on the nature of the information you will store, as well as how it will be accessed.

Store inventories work great as relational databases. For example, managing a store inventory list would be a perfect (and a very simple) example of when to use a relational model DB. You would have a table of items offered by the store, a second table of stores available in the area and a final table which would keep info on the (in-store) quantity of items in each store. Essentially, this is a normalized database design, meaning that each piece of unique data is stored only once and then referenced as needed via an index. The important thing to keep in mind here (in this type of setup) is that the data is limited to how much it can grow, meaning that there is a limited number of stores as well as a limited number of items these stores can offer their customers. Hence, the choice of a relational database is quite a sound choice.

Websites do better with non-relational databases. Now when it comes to the internet, the web introduces a new scale for applications in terms of:


  • Concurrent users (millions of reqs/second)
  • Data (petabytes generated daily ( think google, facebook, etc.)
  • Processing (all this data needs processing)
  • Exponential growth (surging unpredictable demands)


Namely, web sites with very large traffic have no way to deal with these issues using existing relational database systems, even using those high end query crunchers like Oracle, MS SQL, Sybase, etc. The reason for this is simple. Related data is distributed amongst different tables which need to be joined in order to fetch that data. These joins perform more slowly the larger the datasets grow. Furthermore, distributing this data to multiple servers makes the joining of related data even slower to fetch. Sure, you can throw all your data to a single table and repeat if need be, but that kinda defeats the point of a relational database system. Popular web sites that have faced these sorts of issues include Google, Yahoo, Facebook, Twitter, Amazon, etc. Basically, web sites dealing with high traffic, massive data, large user base and user generated content.

Compared to other large scale systems such as telco applications, the above mentioned are FREE apps and can therefore compromise on data integrity and consistency. To be blunt, they wont be sued if someone hasn’t received:

his friends status update on time (FB, Twitter) or
the desired result of a search (google)
The solution for handling this type of web scaling lies in non relational database systems. Although there are a number of systems available freely for download (Redis, CouchDB, MongoDB) the top tier companies like Google, Facebook and Amazon went ahead and built their own custom systems (optimized for their own services) from scratch namely, BigTable, Cassandra and Dynamo respectively. The benefits of these systems include the following:


  • They are massively scalable
  • Highly available, decentralized and don’t have a single point of failure
  • They employ transparent sharding
  • Parallel processing
  • Built in mechanisms for automatic conflict resolution

However, these systems do come with a price, which was proven by scientists at MIT in 2002. To wit, non relational systems must compromise on at least one of the following optimizations: Consistency, Availability and/or Partition tolerance. Under no circumstances can non-relational databases be optimized for all three of the above benefits.

On one of my next "programming tips of the week", I will go deeper into real world examples of specific optimization methods given various types of problems requireg databases are their foundations.

No comments: