Failover testing of a Postgres cluster

Testing failover to one of 2 slaves and reattaching to the new master.

Starting config

Post failover config

Prerequisites:

You will need Vagrant and Virtualbox to run this tutorial but that should be all.

Process

Turn off master.
Promote slave1 to Master
Attach slave2 as follower to slave1
Attach master is as follower to slave1 (without rebuilding)

Steps:

Check the cluster stat and replication:

Stop postgres on the master:

Promote slave1

Connect to slave1:

From your local machine – check the status of slave1:

Check replication slots:

There are none there. Lets create some for our other hosts so that we dont lose ant wal files (you may not want to do this on a production system as it will force the master to keep ahold of wall files untill the new slaves can get them. It is necessary to attach the master straight back in though):

(The old master will still be called master so that we can see what has changed, but it wont be the leader in the new config)

Ok. so now we have a new master and the cluster is back up and running. Now its time to plug in the first follower (slave2) so that we have a standby for failover.

Connect to slave2 and stop postgres:

edit the recovery.conf file to point to the new leader – (slave1)
and set recovery_target_timeline = ‘latest’
Note the primary_slot_name = ‘slave2’ controls which replication slot this node will use.

Start postgres back up again:

Now check that the new follower is pointing at the leader and successfully getting data:

(on slave1 – the leader)

As you can see the ‘slave2’ replication slot is now active and connected.
You can further check this by creating something on slave1 and seeing it appear on slave2.

Now, its time to put the old master back into use. We don’t want to use it as the master agian because we already have a leader and switching would mean more downtime so we will put it in as another follower:

Doing that can be tricky because failing over could have put the cluster on a different timeline. This is easy to get around though by telling the old master to use the latest timeline.

Connect to the master node and create a recovery.conf file as above:

Start the instance back up and check that the replication is connected and recieveing changes:

As you can see Everything is connected and changeds are being passed on.

All thats left now is a bit of tydying up. The ‘master’ node still has replication slots configured and thess need to be removed.
Its simple enough to do:

Leave a Reply

Your email address will not be published. Required fields are marked *