What is apache Cassandra?

No Comments

Cassandra is a fast distributed database.
It has several defining features:

  • Built in high availability. – Any node can handle read and write requests and your data is replicated to x nodes so regardless of which node (or even a data center) goes down, you will still have access to read and write your data.
  • Linear Scalability. – Doubling the number of (identical) nodes should double the write performance. Its basically as simple as that was all nodes can handle all operations and there is no central control.
  • Predictable performance. (i.e. doubling the number of identical nodes should double the write throughput)
  • no single point of failure. -nodes can go down and come back up without the front end application becoming aware of it.
  • Multiple Data Centres catered for and taken advantage of out as standard.
  • Built to run on commodity hardware – so you can run it on lots of $1000 servers rather than 1 or 2 $100000 servers.
  • Easy to manage operationally. – The system is designed to need very little ops input.

Cassandra is not a drop in replacement for RDBMS – it has a different data model and requires thinking of your data store in a different way.

The basic topology of a Cassandra cluster is that of a Hash ring. There are no master/slave /replicas etc, each node is essentially equal and capable of performing any operation that another node can perform. There are no config servers, again, that function is shared by all nodes.

The stored data is partitioned around the ring. Each node owns a piece of the ‘token’ which is a range of values. The primary key or partition key for a given table is hashed and each possible value forms a piece of the token. This means that you can tell which node owns a given piece of data easily by hashing the required value nd comparing the result to the token ownership.
While any piece of data is owned by one node, the data is replicated to RF=n servers (where RF is the ‘Replication factor’ set at keycap creation), so multiple nodes will have the data and could answer queries on it.
All nodes hold data and answer queries.

The CAP tradeoff refers to it being impossible to be both consistent and highly available (during periods of network partition). If you insist on consistency then your service will be unable to respond if it can’t reach all nodes and if you insist on high availability then you must risk inconsistent data if you can’t reach a particular node.
Latency between data centers also makes consistency impractical. Cassandra chooses Availability and partition tolerance over consistency.

However – You can can tune availability vs consistency in Cassandra to produce a fully consistent system, a fully HA focused cluster or more commonly a tradeoff that sits somewhere in the middle.

Replication

Data replicated automatically and without user intervention. The only thing that you need to do is to pick the number of servers that you would like your data replicated to . This is called the Replication factor. Data is always replicated to each replica automatically. If a machine is down, missing data is replayed via a ‘hinted handoff’ when that machine becomes available.

Consistency level.

In Cassandra, consistency can be set on a per query basis. That means that if one query requires data to be returned very fast, you might set your consistency level low (‘ONE’ being the lowest level) , but if you need to make sure that a write is replicated and will be consistent then you could set that query to a high level of consistency (‘ALL’ being the highest). Common options are ‘ALL’, ‘ONE’ and ‘QUORUM’ (which is more than half the nodes). The consistency level determines him many nodes must respond with OK for a given query.

Read and write is faster with lower consistency levels.
Availability is negatively affected by raising consistency levels.

Multi DC

Cassandra has strong support for multiple data centers. A DC is a logical separation in Cassandra which should map to physical separate machines and along with ‘racks’ which group machines together determines which nodes have data replicated to them from a particular owner.
A typical useage example would be to write to local DC and replicate async to other DCs for high availability.
Replication factor is set per keyspace, per data center.

Categories: Cassandra Tags: Tags:

Relational Databases and Big Data workloads.

No Comments

This intro to Cassandra is taken from the DataStax course. I don’t necessarily agree with everything – particularly their take on what a traditional RDBMS can and can’t do but I have included their view here for completeness.

Cassandra is designed for ‘Big Data’ workloads. Im order to understand the characteristics of Big Data, lets first define ‘Small Data’:

This would typically be a volume of storage that would fit on 1 machine and a RDBMS is typically fine and able to handle the number of operations and the quantity of data. The system will support a number of concurrent users in the hundreds. It fully supports ACID.

When you want to scale such a system, you are going to do it vertically first – with a bigger host, more RAM or processors.

Can Relational databases support big data?

Read More…

Categories: Big Data, Cassandra

What is Hadoop?

No Comments

If you have ever wondered what Hahoop, Memcached, NoSQL, Mapreduce or many of the other data based technologies that are springing up are, this is a superb whistle stop tour or the market.

 

Hadroop is basically a system for distributing your data across multiple storage servers that all have processing capacity. Then when you want to find out something about your data, you formulate your question and it is passed to each server which then interrogate the data that it stores.

 

One key point from the video is that Big Data and RDBMS technologies operate on different problems. RDBMS systems are very well suited for problems requiring OLTP and data warehouse type solutions and Big Data systems are never going to encroach on this area. Big data addresses a new problem area of unstructured data and the questions that it poses. They are 2 complementary rather than competing technologies.

 

 

Highly recommended viewing.

 

 

Categories: Hadoop Tags: Tags: ,