What is apache Cassandra?

Cassandra is a fast distributed database.
It has several defining features:

  • Built in high availability. – Any node can handle read and write requests and your data is replicated to x nodes so regardless of which node (or even a data center) goes down, you will still have access to read and write your data.
  • Linear Scalability. – Doubling the number of (identical) nodes should double the write performance. Its basically as simple as that was all nodes can handle all operations and there is no central control.
  • Predictable performance. (i.e. doubling the number of identical nodes should double the write throughput)
  • no single point of failure. -nodes can go down and come back up without the front end application becoming aware of it.
  • Multiple Data Centres catered for and taken advantage of out as standard.
  • Built to run on commodity hardware – so you can run it on lots of $1000 servers rather than 1 or 2 $100000 servers.
  • Easy to manage operationally. – The system is designed to need very little ops input.

Read more

Relational Databases and Big Data workloads.

This intro to Cassandra is taken from the DataStax course. I don’t necessarily agree with everything – particularly their take on what a traditional RDBMS can and can’t do but I have included their view here for completeness.

Cassandra is designed for ‘Big Data’ workloads. Im order to understand the characteristics of Big Data, lets first define ‘Small Data’:

This would typically be a volume of storage that would fit on 1 machine and a RDBMS is typically fine and able to handle the number of operations and the quantity of data. The system will support a number of concurrent users in the hundreds. It fully supports ACID.

When you want to scale such a system, you are going to do it vertically first – with a bigger host, more RAM or processors.

Can Relational databases support big data?

Read more