Making the Right Database Choice
If someone were to ask, should I use SQL or NoSQL database, the obvious answer is "it depends". Depends on what? It depends on various aspects and that is what I would like to share my views on, today
There are so many theories and so many databases out there that this is always a big decision point especially in green-field projects. Enterprises that have already invested in one or the other do not leave much of a choice many times.
But if you are starting fresh and need to make a choice, I would propose a simplified 4 step process which would work for most cases, though complex and nuanced cases may require greater thought and deeper considerations.
Step 1: Should you use an RDBMS or a NoSQL Database?
First consideration should be whether an RDBMS itself would serve the purpose. If yes, do not think twice about going for a NoSQL DB. There are a whole host of features taken for granted by using an RDBMS. They come in very handy and make the ease of use immense.
What is the fundamental difference between the two? One is an ACID Database and the other is a BASE data base
You can see a detailed definition here
Scale, is one of the most important features brought in by NoSQL DBs. if and only if you need this to the extent that RDBMS cannot provide, think of a NoSQL DB as it comes with its own backpack of nuances that need to be taken care of.
Go for an RDBMS, if you need:
Consistent data with ACID (Atomicity, Consistency, Isolation and Durability) support.
Complex dynamic queries requiring stored procedures or views
Option to migrate to another database without significant change to existing application
Data Warehouse, Analytics or BI use cases
Also be aware that now a days there are RDBMS like PostgreSQL which provide horizontal scalability too
Think of NoSQL DB in these cases:
If you are dealing with semi-structured or unstructured data
You need flexible schema
You have very limited ways of accessing data/querying data
You have to deal with very high velocity transactions
You have very large volume of data requiring quick and cheap scalability
You do not have complex queries, stored procedures or views
You do not have data warehouse needs, analytics or BI use cases
Step 2: Which NoSQL?
If you have decided on an RDBMS, I do not have to say much. But if you have a use case for NoSQL DB, then the question is which one should I choose?
For this, let's understand the 4 broad types of noSQL DBs that exist.
Key-Value Stores - Data is stored as key-value pairs
Document Stores - Data is stored as JSON, XML or BSON documents
Column-Oriented Stores - Data is organised by columns
Graph Databases - Data storage focuses on relationships between data elements
Here is a ready-reckoner check list for when to use what - a good place to start with:-
Key Value Datastore Checklist:
Simple Schema
High velocity read/write with no frequent updates
High performance and scalability
No complex queries involving multiple keys or join
Document DB Checklist:
Flexible Schema with complex querying
JSON/BSON or XML data formats
Leverage complex indexes (multikey, geospatial, full text search etc.)
High performance and balanced read-write ratio
Column-Oriented Database:
High volume of data
Slower writes but fast reads (with compressed storage)
Data extractions by columns using row keys
No ad-hoc query patters, complex indices or high level aggregations
Graph DB:
Need to maintain connections between data points
The links and relationships between data is the most important aspect
Typical for social networks and knowledge graphs
So, let's have a look at some of the example databases of each type:
Now, suppose I have narrowed down to one type, there are so many of the same type. Which within that should I choose? Say, Column-oriented DB is what I need. Should I go with Cassandra or HBase? If I need a document DB, should I go with Couchbase or MongoDB.
This is answered in Step 3:
Step 3: Apply the CAP theorem
What does the cap theorem say? A distributed database can only provide 2 out of 3 guarantees between:
Consistency - All nodes in the cluster have consistent data and a read request returns the most recent write from any node.
Availability - A non-failing node must always respond to requests in a reasonable time
Partition Tolerance - System continues to operate during network or node failures.
As you can see, within Document DB, if you want CP, you will go for MongoDB and if you want AP, you will go for Couchbase.
In Column oriented, if you want CP, you will go for HBase and if you want AP, you will go for Cassandra.
This is a very simplified view. Each of them are further tunable but I am not getting there. Would leave it for further research to you once you come up to this point.
Step 4: Other Considerations
Even if you have reached this far, you are not done. There are a few other considerations that look trivial to begin with but will matter a lot in the long run. So do not over look the support provided by these databases on these aspects:
Backup And Recovery
Replication of data: Is it Synchronous, Asynchronous or Quorum-based?
What are the Read/Write and indexing strategies?
Do they provide a Concurrency control?
Support for Security, access controls and encryption at rest
Query methods supported
DR Configuration and Uptime capabilities
Conclusion
If you have been able to apply all of the above 4 steps and its checklists and answer yourself based on the use case you are looking for, you would have nearly made the right decision 80 to 85% of the times.
Every use case will have certain very custom needs that you may have to look at beyond this generalised way.
Amazing!
Good one !
brilliant. really enjoyed reading this.