Two Node Planned Manual Failover for the MySQL Database Administrator


A database administrator needs to plan for both planned and unplanned failovers, based on business requirements and service level agreements with customers. This article runs through the very basic starting points of a MySQL failover strategy then walks through a two-node, planned, manual failover.

In the
wonderful world of MySQL, there are many failover techniques that can be done.
Failover techniques range from easy to slightly more difficult. I guess that
“easy” and “difficult” are relative terms depending on the skill set of the database
administrator and their personal preference.

In this
article I will run through the very basic starting points of a MySQL failover
strategy then walk you through a two-node, planned, manual failover. Just know
that there are many ways to skin this cat but the idea is pretty much the same
wherever you go, make sure it’s right at the end!

To put it
simply, there are two reasons a database administrator would failover, planned
and unplanned. As a database administrator, we need to plan for both the
planned and unplanned failovers based on business requirements and service
level agreements with customers. Here are two questions that you should ask
while building your failover strategy:

1.  
What is our
uptime requirement?

2.  
Is it acceptable
to lose a transaction or two?

The answers
to the questions above should lead you closer to what the business wants in
terms of acceptable downtime and data integrity. Most of the time the
questions above lead to many options and possible scenarios’, for now, I am
just keeping it simple.

There are
many manual, semi-manual and automatic failover mechanisms that can be
implemented to fit almost any business requirement or service level agreement.
I have seen and worked with requirements that range from “as fast as you can”
to “get to it when you want.” For this article lets just say that we are
shooting for somewhere in between the two listed above.

Answer to
question #1: Our acceptable amount of downtime per month will be 20 minutes.

Answer to
question #2: We are dealing with non-financial data and it can be considered
acceptable to lose a transaction or two during an unplanned outage.

The classic
place to start your failover strategy is with a two-node cluster, one master
and one slave server. All writes and read activity are running to the master
server while optional read activity can be running to the slave server. For
this example I have my master server running on 192.168.1.101 and my slave
server running on 192.168.1.102. The slave server is running with –log-bin
enabled and without –log-slave-updates because we only have one slave. This
way the slave server is ready to become a master when you issue stop slave, reset
slave, reset master and change master to. The data below will show that the slave
server is caught up to the master server with “show master status;” on the
master and “show slave statusG” on the slave.

Master
Server (192.168.1.101):

([email protected]) [(none)]> show master status;
+---------------+-----------+--------------+------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------+-----------+--------------+------------------+
| binary.000072 | 119385271 |              |                  |
+---------------+-----------+--------------+------------------+
1 row in set (0.00 sec)
 
Slave Server (192.168.1.102):
 
([email protected]) [(none)]> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.101
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: binary.000072
          Read_Master_Log_Pos: 119385271
               Relay_Log_File: mysqld-relay-bin.000245
                Relay_Log_Pos: 250
        Relay_Master_Log_File: binary.000072
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 119385271
              Relay_Log_Space: 550
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
1 row in set (0.00 sec)

In a
planned outage you will need to promote the slave to a master. The first step
is to stop all write activity to the current master server. During a planned
deployment either you or the owner of the application using the MySQL server
can accomplish this.

As a DBA
you can revoke access to the user connecting to MySQL or remove the user account
all together. This is NOT recommended but I threw it in there because I’ve had
to do this before. Some problems you might encounter:

1.  
The application
might freak out.

2.  
It is easy to
forget what you changed about the user account or if you deleted it all together.

In a clean
deployment you should have the owner of the application stop read/writes on
their side then check the processlist on the MySQL server to make sure there
are no more connections on the master.

A good way
to check this is to look at the position of the master log to see if it has
changed (i.e. show master status;). When the coast is clear you need to look
at the output from “show slave statusG” and make sure that the slave is caught
up. The Master_Log_File and Relay_Master_Log_File need to be the same along
with the Read_Master_Log_Pos and Exec_Master_Log_Pos. You can see above, that
all four points of interest in the “show slave statusG” output are correct.
This cluster is ready to be failed over.

On the
slave server you need to run, stop slave and reset slave then record the output
from “show master status”. The output from “show master status” on the slave
server is as follows:

([email protected]) [(none)]> show master statusG
*************************** 1. row ***************************
            File: binary.000003
        Position: 107
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

On the original
master server you will need to run a “change master to …” statement so
replication can occur from the 192.168.1.102 (original slave) server to the
192.168.1.101 (original master) server. The “change master to … “ statement is
described below:

CHANGE MASTER TO
  MASTER_HOST='192.168.1.102',
  MASTER_USER='repl',
  MASTER_PASSWORD='somebigsecret',
  MASTER_PORT=3306,
  MASTER_LOG_FILE='binary.000003',
  MASTER_LOG_POS=107,
  MASTER_CONNECT_RETRY=10;

You need to
run the “change master to …” statement on the 192.168.1.101 (original master)
server so it knows where to start replicating from on the new master. After the
“change master to …” is run you need to start the slave by running “start
slave;”. Make sure you check that everything is OK with “show slave status” on
192.168.1.101 and “show processlist” on 192.168.1.102. It’s always a good idea
to create a database on the new master and see if it replicated over.

Show slave
status on the new slave (192.168.1.101):

([email protected]) [(none)]> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.102
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: binary.000003
          Read_Master_Log_Pos: 107
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 250
        Relay_Master_Log_File: binary.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 407
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2

Show processlist
on the new master (192.168.1.102):

Notice the
repl user has a connection with the thread id of 158 from 192.168.1.101.

+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| Id  | User        | Host                | db   | Command     | Time | State                                                                       | Info             |
+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| 158 | repl        | 192.168.1.101:46643 | NULL | Binlog Dump |  103 | Master has sent all binlog to slave; waiting for binlog to be updated       | NULL             |
| 160 | root        | localhost           | NULL | Query       |    0 | NULL                                                                        | show processlist |
+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
 

Your
developer or owner of the application is now able to point read and write
activity to the new master server (192.168.1.102).

Thoughts about this process:

This really
is the most simple and manual way to run a failover. The method above is fine
and dandy in a perfect world but during an unplanned outage at 3am on a
Saturday morning, after a night of partying while you’re on-call…. Well, you
get it!

Future
articles will be on better failover methodologies, but I had to start
somewhere!

Additional Resources

MySQL MySQL Failover Strategy using State Management, introducing MPP – Part 1

MySQL 16.3.6. Switching Masters During Failover

»


See All Articles by Columnist

Chris Schneider

Chris Schneider
Chris Schneider
Certified MySQL professional, actively involved in the MySQL community for the better end of a decade. I have designed, implemented and maintained small to large scale MySQL installations while training and mentoring teams of DBAs. This includes building architectures from the ground up and improving on those that are currently in place while emphasizing scalability, performance and ease of use. Most of the work I am involved with now is in the Web 2.0 arena. I also run a community site, EverythingMySQL.com, that promotes collaboration within the open source community, specifically, MySQL.

Latest Articles