Relational Databases and Big Data workloads.
This intro to Cassandra is taken from the DataStax course. I don’t necessarily agree with everything – particularly their take on what a traditional RDBMS can and can’t do but I have included their view here for completeness.
Cassandra is designed for ‘Big Data’ workloads. Im order to understand the characteristics of Big Data, lets first define ‘Small Data’:
This would typically be a volume of storage that would fit on 1 machine and a RDBMS is typically fine and able to handle the number of operations and the quantity of data. The system will support a number of concurrent users in the hundreds. It fully supports ACID.
When you want to scale such a system, you are going to do it vertically first – with a bigger host, more RAM or processors.
Can Relational databases support big data?
The first thing you find is that there is no clear point where a system becomes ‘Big Data’ and you can no longer use a RDBMS. In fact RDBMS’s are useful for a lot of applications that you might consider Big Data. However, as you scale, you will start running into difficulties and begin having to consider tradeoffs in order make your system perform acceptably.
Firstly, as you get bigger (in terms of data but more in terms of throughput) you will find that ACID is harder to maintain. The course claims that it is a ‘lie’ which I thin is a bit strong but you do have to start to think about where it will or won’t work for you.
For example, generally, synchronous replication will cause performance to be too slow for a production system, so you are left with asynchronous replication. By definition, there will be a gap between the time data is written to the master and when it is applied to the slave (to use PostgreSQL as an example). This replication lag can be monitored and felt with but if your requirement is for almost instantaneous consistent reads then the have a difficult problem to solve. You either have to wait for changes to be replicated or accept that the data you get back may not represent the most recent changes.
Other issues raised are:
- Third normal form does not scale and queries are unpredictable
- users are impatient
- data must be denormalized
- Disk seeks are slow and there is a big slow down if data not in memory which will be a issue on a large data set.
- Solution is to build tables to supply slow queries – This leads to data duplication
The typical solution is to distribute your data across nodes through some Sharding solution.
There are a number of difficulties and problems that are normally faced :
- Data distributed
- no more joins
- no more aggregations
- You need to denormalize all the things
- Querying secondary indexes means hitting every shard (because the planner can’t work out where the relevant data is ahead of time)
- adding shards required manually moving data
- Schema Changes can be difficult to propagate over the nodes.
In fairness, these points are being addressed in Postgres and tools like pg_shard go a long way to make this much easier and more feature rich.
The next issue is high availability.
Most applications at some level require high availability but big data apps perhaps more so. The problem with most RDBMS systems is that they require manual intervention at some level.
It can of course be automated with scripts or other systems but this is a bolted on hack, it is another moving part that is not part of the core system and someone must be responsible for failing over the Master at some level. The most important and biggest problem though is that in order for downtime to be detected and failover to occur, there must be some downtime. – There must be a time where the system does not work!
Another issue is that because there is only one master, many things can cause downtime – hardware problems for example but also patching and upgrades.
Again, this issue may well be addresses soon in Postgres with the advent of Bi Directional Replication and multi master setups but it is not there yet.
When looking for the right database system for your application, the trend is to go for the new and cool system and some times, that is the best option – these systems were built for a reason. RDBMS systems have ben working well for a long time now though and continue to improve. It pays to takes a hard look at RDBMS and work out if any of the limitations with them will really be an issue because most NoSQL systems come with some fairly major limitations of their own and RDBMS’s like Postgres are still far more flexible and faster for most use cases.
In summary, some of the difficulties that you may face when using a RDBMS are:
- Scaling is a pain
- ACID is naive at best
- you aren’t truly consistent
- Re-sharding is a nightmare
- You may well need to denormalize for performance
- HA is complicated and requires operational overhead
Some lessons learnt by the designers of Cassandra and factored into the design:
- Consistency is not practical
- So give the users the choice (consistency vs performance)
- Manual sharding and rebalancing is hard
- So lets build it in and let the core engine deal with it.
- Every moving part makes the system more complex
- So lets simplify our architecture
- Scaling up is expensive
- We want commodity hardwaere
- Scatter/gather is no good
- We denormalize for real time query performance