AWS Redshift is a very powerful database, but it is quite different from traditional a RDBMS in terms of architecture and also the sorts of decisions that you need to make and planning that you need to perform before building and operating a Amazon Redshift cluster.
This Redshift AWS tutorial aims to give you everything that you need to get started with AWS Redshift in as little time as possible, by covering the basics of architecture and the structure of the database, optimizations that you can make and some advice on performance tuning and maintaining a AWS Redshift cluster.
What is Redshift AWS
Redshift is a fast scalable data warehouse, it focuses on fFast performance and is optimized for online analytical processing olap opposed to oltp.
Redshift is designed to deal with huge amounts of data and is billed a having a petabyte scale architecture.
It is based on postgres so postgres tools should mainly work with it. From an architectural point of view it is very different though.
It can scale up or out on aws infrastructure to accommodate a range of data sizes and performance requirements.
Redshift has a number of optimizations that it can perform, which will help in both speeding up queries and saving space.
- Data compression
- Tools to determine the most effective compression algorithm
- Columnar storage
- Can just interact with one column much faster because columns are all stored separately.
The qualities of a Columnar data stores
Efficient where a large portion of the data will be returned for most queries as well as on large datasets.
If a large part of a table is going to be returned, normal SQL dbs will end up doing full table scans rather than using indexes. This is very slow. In normal dbs, rows are stored together so you have to read all data when a full table scan is done.
Columular stores store just each column together so if only 2 columns are needed then only 2 columns will be read.
All columns making up a table have corresponding rows at the same offset.
Important facts about AWS Redshift.
Zone maps are Chunks of data with associated metadata that are contained with the data so that queries can be analysed to see if it actually needs to run and also to reduce the number of reads needed for queries..
For example min and max values
AWS Redshift is deployed as a Zonal deployment – It is not multi-az. This is so that data transfer can be optimized and lag reduced as much as possible by keeping data close together.
Redshift is Optimized for large writes – Smaller writes have similar costs to much larger writes so it pays off to load large chunks of data at once.
You can access Redshift with Sql clients – Redshift is based on PostgreSQL so clients that will connect to postgtres should work.
Sql drivers – ODBC and JDBC should work. You will need to connect to a specific url for the driver that uor want to connect to.
The SQL that Redshift uses is based on postgres sql 8.7.2 but has some differences
- Misses some postgres features like triggers and indexes
- Gains some extra features like copy or upload, comular datastore features etc
Redshift cluster configuration
Basic Redshift Configuration options:
- Use dense storage or dense compute node ds or dc
- Each node had different amounts of storage and cpu
- Can start with a single node
Leader node –
All multi node clusters have a leader node. For single node clusters the node is both the leader and a compute node.
Leader node is entrypoint for communication with cluster
Typically connect with a sql client and or a jdbc driver
AWS provides an endpoint to connect to.
When a request comes in, the request gets converted into query plans by the leader node. Instructions get sent to compute nodes as C++ binaries.
The leader node:
- Stores cluster metadata so that a query planner can work out where to send instructions and plan the best ways to query data.
- Coordinates parallel processing on compute nodes
- Aggregates data when it is returned by the compute nodes.
- Maintains catalog tables – like the pg_ tables in postgres – useful data for DBA’s
In single node clusters leader is also compute node – otherwise in multi node clusters, they are separate.
- Execute the queries that they get (C++ binaries)
- Return collected data to the leader
- Used for the backup and restore of data
- Actually store the data when it is stored locally.
- Replicate data between nodes.
Most processing happens on the compute nodes.
Components on compute nodes
- Virtual compute nodes
- 2,16 or 32 slices based on size
- Unit of data partitioning
- Slices act independently and process their own data. Files are split for better performance and distributed between slices.
- Split files into multiples of the number of slices so that they can be efficiently distributed.
- Locally attached storage
- Actually have 2.5 – 3x the advertised storage capacity for mirrored data
- 1,3,8 or 24 disks per node all split into partitions. This is for mirrored data. One partition of the compute nodes storage and others for mirried data from other nodes..
- Inside the disks – 1mb immutable chunks containing zone maps – basically metadata of the data in the block ( min max and location of next block)
- Blocks can have different compression encodings and settings
Compute node types
- Dense compute node – less storage but greater performance – SSD’s, higher IO
- Dense Storage node – more storage but less performance
- Variation between different node types
- Max cluster is 128 nodes – need larger node types for this
There is a max to storage that can be held locally. It is based on the maximum number of nodes of the type you are using and their associated storage.
Used to decide where data will go ie which compute node and then where it will go on that node.
Distribution style is set per table –
Ideally, you want to:
- distribute data uniformly over your nodes and slices.
- Minimize data movement during query execution
- Unform (EVEN)- round robin over slices within the cluster
- Useful when table is not going to be joined and also when ther is np clear choice based on something else.
- This is the default distribution style
- Key – distributed based on the values of one column
- Matching values go in the same place
- If based on joining values the data for those values will be located on the same slice
- ALL – Copy of every table is distributed to every node.
- Takes longer to load update or insert as there are more copies
- Small dimensional tables do not benefit much because cost of redistribution is already low
EVEN and KEY place a portion of data on each node.
ALL distribution ensures that every row is colocated for every join it participates in.
Use DISTSTYLE option on create table command to set this.
Create Table test_table_dist_all (
If you want to set a key distribution key, you can set it in the create table command:
Create Table test_table_dist_key (
_2 date distkey,
Get data about that table:
SELECT “column”, type, encoding, distkey, sortkey, “notnull”
WHERE tablename = “test_table_dist_key”;
Load data with a command like:
DateFormat as ‘DD/MM/YYYY’
Column compression is a column level operation – It reduces size of stored data
You can manually set the encoding to compress a column when creating a table or set it in the copy command when loading data.
You cannot change column encoding once a table is created. Your only option in order to change the encoding type is to either remake the table or add a new column.
Redshift will automatically assign an encoding when copying data in. You can analyze the table to see if you can use a better encoding to get better compression.
Analyze compression [TABLENAME];
You can add a new column to an existing table and specify the compression type to use
Redshift Sort Keys
- Sortkey is set on table creation
- Data on the disk will be stored in sort key order in blocks on disk .
- The blocks zone map will contain min and max values of this column.
- Similar to index in other systems
- Can eliminate whole blocks from needing to be queried this way.
- The sort key will need to be chosen based on use case.
- If you want to sort data on a column it might be a good choice.
- Range filtering can be helped by sort keys
- Frequent joins can benefit from using the join conditions as sort keys.
Types of sort key
- Single sort key
- Single value to sort data
- Best when using a single filter or sort column
- Compound sort key
- Multiple values in order of sort priority
- Most performance on first key
- Useful when filtered columns are always the same
- Performance benefits when filtering on any of listed keys
- Interleaved sort key
- Weights all columns evenly for sorts
- Useful for sorting on many different columns
- Needs frequent vacuums to clean up data
- The blocks need to be cleared out often
Use the sortkey option on the column in the create table statements for a single sort key.
For compound sortkey or interleaved sortkey you can specify
- Compound sortkey(columnlist);
- Interleaved sortkey (columnlist);
At the end of the create table command.
Loading and Unloading Data in Redshift
Loading – Copy command
Copy command lets you load data
Can have a source of S3, Amazon EMR, DynamoDB and from remote host via SSH.
From S3, you can load a specific Key, all objects in a specific folder or multiple files in different buckets using a manifest file.
Using S3, you specify a bucket / file sources, iam role for access permissions etc.
A manifest file is a JSON file at a specified location containing the files you want to load
If using a manifest file, you need to specify manifest in the copy command.
Unload allows you to take the results of a query and store them in s3.
Both operations require authorization. This can be
- An iam_role
- The access_key and secret access_key parameters
- Optional session_token with access keys or temporary credentials
- Credentials parameter string
Amazon Redshift Deep Dive
If you prefer video, this is a really good and detailed deep dive into Amazon Redshift, covering mostly concepts, components and design.
Some notes on Redshift resilience:
Redshift is not really built for high availability. It is build and designed to quickly load data in and query that data in a fast and scalable way. The data is expected to be elsewhere and once it is in it should be backed up after a load to a snapshot in order to protect it.
Because Redshift is a data warehouse solution for analytics real high availability is not really a priority.
That being said, there is no reason why it would go down under normal operation and it can be rebuilt from snapshots quickly.
The one exception to this is the weekly scheduled maintenance window.
A given instance must have a maintenance window and it is not possible to say do not take this instance down for maintenance.
It is possible to defer the maintenance window for up to 45 days:
If you defer your cluster’s maintenance, the maintenance window following your period of deferment is mandatory. It can’t be deferred.
It is also possible to manually change the maintenance window to be ‘yesterday’ which would skip a maintenance window until the next week. This must be done manually each week so it is not a long term solution. It is also unclear how many times you can do this.
Essentially, Redshift is designed to have a short downtime each week and it is not really avoidable (although, the instance will not be taken down if there is no maintenance item required). Amazon’s advice is to design the application to accommodate the fact that there will be downtime.
One alternative would be to use Redshift Spectrum. This allows you to keep your data in an S3 bucket instead of loading it into your Redshift instance. This would allow us to have 2 redshift clusters and both could query the same s3 bucket.
So long as the maintenance windows were managed and kept apart and some sort of load balance was put in place then we could ensure 24/7 access.
I can’t find any data on speed comparisons between loaded and unloaded data and it is likely to be quite expensive to run a second cluster just to fill a 30 minute window though.
I did find a way of manually building a second az cluster recommended by amazon but it is a fully manual process using Kinesis to stream changes to the second cluster. This is likely to be even more expensive:
Redshift performance issues
This is a big topic and has already been covered by a lot of people. I am going to include some great resources here that will save you time and walk you through diagnosing issues.
This is an excellent article detailing how to diagnose performance issues in Redshift and talking about how to fix them once you find them.
The author of that article has published some helpful queries for digging in to your cluster and seeing what’s going on:
Redshift Performance Tuning
This article is really in depth covering most things that you can look at to squeeze extra performance out of Redshift:
Cutting costs in Redshift.
The biggest problem when your data starts to grow is that you need more nodes and these each cost a lot. Therefore, if you can reduce the amount of space that you use, you can save quite a bit on the cluster.
Here are some practical ways to reduce the storage requirements in Redshift and reduce costs.
Always encode your columns
This query will analyse the specified table and return what redshift thinks is the best compression type to use for that column. YOu can make huge space savings by using the best method and as storage contributes in a large way to your costs then it can make a big impact.
analyze compression schema.table;
In order to implement this, you will want to:
- Create a new table using the recommended encoding.
- Don’t encode your sort key as you need this untouched to get the maximum performance benefit.
- Copy the data to the new table
- Drop the old table
- Optionally rename the table if it was being referenced anywhere.
VAcuuming Redshift performs a number of tasks. It:
- Make sure that your data is properly distributed across nodes which improves query performance.
- Sorts the data inside each node – again to improve your query performance.
- Cleans out unneeded/deleted data.
Removing the deleted data can free up a lot of space potentially – so vacuum frequently to keep space manageable.
Choosing the minimum number of nodes for your cluster.
In most systems, increasing the number of nodes in a cluster will increase availability because things are distributed and one node going down can be tolerated, or there are more options for failover.
Because of the way redshift works, if one node goes down, the whole cluster goes down. So the more nodes that you have, the less resilient the cluster will be.
Still, the jumps in node types cn be quite large. This means that you need ot find a balance between the number of nodes and cost.
The figures that you need to determine how much storage you will need are the daily data upload times the number of days that you need to see data for.
Don’t forget that data that isn’t often needed can be archived back down to s3.
What data are you actually querying?
Often, in all types of databases, most of the queries that you run will be on recent data. You need the old data but it is rarely used.
In this situation, it might benefit you to split your data out into separate clusters. The new data that is being queried a lot can be on more powerful SSD backed nodes but you could split your old data off of another cluster with cheaper hardware and HDD disks. Queried on that cluster will be slower but they make up a far smaller percentage of your overall workload and it will probably be a large percentage of your data.
Redshift Table size.
There are lots of ways that you can find your redshift table size as well as the size of other objects and your whole database.
This article is a good tutorial on getting that data back:
Views in redshift are slightly controversial. They =can pose performance problems but can also be useful for certain cases.
Here is a selection of useful articles on the topic: