Two Node Planned Manual Failover for the MySQL Database Administrator

July 6, 2010

A database administrator needs to plan for both planned and unplanned failovers, based on business requirements and service level agreements with customers. This article runs through the very basic starting points of a MySQL failover strategy then walks through a two-node, planned, manual failover.

In the wonderful world of MySQL, there are many failover techniques that can be done. Failover techniques range from easy to slightly more difficult. I guess that “easy” and “difficult” are relative terms depending on the skill set of the database administrator and their personal preference.

In this article I will run through the very basic starting points of a MySQL failover strategy then walk you through a two-node, planned, manual failover. Just know that there are many ways to skin this cat but the idea is pretty much the same wherever you go, make sure it’s right at the end!

To put it simply, there are two reasons a database administrator would failover, planned and unplanned. As a database administrator, we need to plan for both the planned and unplanned failovers based on business requirements and service level agreements with customers. Here are two questions that you should ask while building your failover strategy:

1.   What is our uptime requirement?

2.   Is it acceptable to lose a transaction or two?

The answers to the questions above should lead you closer to what the business wants in terms of acceptable downtime and data integrity. Most of the time the questions above lead to many options and possible scenarios’, for now, I am just keeping it simple.

There are many manual, semi-manual and automatic failover mechanisms that can be implemented to fit almost any business requirement or service level agreement. I have seen and worked with requirements that range from “as fast as you can” to “get to it when you want.” For this article lets just say that we are shooting for somewhere in between the two listed above.

Answer to question #1: Our acceptable amount of downtime per month will be 20 minutes.

Answer to question #2: We are dealing with non-financial data and it can be considered acceptable to lose a transaction or two during an unplanned outage.

The classic place to start your failover strategy is with a two-node cluster, one master and one slave server. All writes and read activity are running to the master server while optional read activity can be running to the slave server. For this example I have my master server running on 192.168.1.101 and my slave server running on 192.168.1.102. The slave server is running with --log-bin enabled and without --log-slave-updates because we only have one slave. This way the slave server is ready to become a master when you issue stop slave, reset slave, reset master and change master to. The data below will show that the slave server is caught up to the master server with “show master status;” on the master and “show slave statusG” on the slave.

Master Server (192.168.1.101):

(root@localhost) [(none)]> show master status;
+---------------+-----------+--------------+------------------+
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------+-----------+--------------+------------------+
| binary.000072 | 119385271 |              |                  |
+---------------+-----------+--------------+------------------+
1 row in set (0.00 sec)
 
Slave Server (192.168.1.102):
 
(root@localhost) [(none)]> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.101
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: binary.000072
          Read_Master_Log_Pos: 119385271
               Relay_Log_File: mysqld-relay-bin.000245
                Relay_Log_Pos: 250
        Relay_Master_Log_File: binary.000072
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 119385271
              Relay_Log_Space: 550
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
1 row in set (0.00 sec)

In a planned outage you will need to promote the slave to a master. The first step is to stop all write activity to the current master server. During a planned deployment either you or the owner of the application using the MySQL server can accomplish this.

As a DBA you can revoke access to the user connecting to MySQL or remove the user account all together. This is NOT recommended but I threw it in there because I’ve had to do this before. Some problems you might encounter:

1.   The application might freak out.

2.   It is easy to forget what you changed about the user account or if you deleted it all together.

In a clean deployment you should have the owner of the application stop read/writes on their side then check the processlist on the MySQL server to make sure there are no more connections on the master.

A good way to check this is to look at the position of the master log to see if it has changed (i.e. show master status;). When the coast is clear you need to look at the output from “show slave statusG” and make sure that the slave is caught up. The Master_Log_File and Relay_Master_Log_File need to be the same along with the Read_Master_Log_Pos and Exec_Master_Log_Pos. You can see above, that all four points of interest in the “show slave statusG” output are correct. This cluster is ready to be failed over.

On the slave server you need to run, stop slave and reset slave then record the output from “show master status”. The output from “show master status” on the slave server is as follows:

(root@localhost) [(none)]> show master statusG
*************************** 1. row ***************************
            File: binary.000003
        Position: 107
    Binlog_Do_DB: 
Binlog_Ignore_DB: 
1 row in set (0.00 sec)

On the original master server you will need to run a “change master to …” statement so replication can occur from the 192.168.1.102 (original slave) server to the 192.168.1.101 (original master) server. The “change master to … “ statement is described below:

CHANGE MASTER TO
  MASTER_HOST='192.168.1.102',
  MASTER_USER='repl',
  MASTER_PASSWORD='somebigsecret',
  MASTER_PORT=3306,
  MASTER_LOG_FILE='binary.000003',
  MASTER_LOG_POS=107,
  MASTER_CONNECT_RETRY=10;

You need to run the “change master to …” statement on the 192.168.1.101 (original master) server so it knows where to start replicating from on the new master. After the “change master to …” is run you need to start the slave by running “start slave;”. Make sure you check that everything is OK with “show slave status” on 192.168.1.101 and “show processlist” on 192.168.1.102. It’s always a good idea to create a database on the new master and see if it replicated over.

Show slave status on the new slave (192.168.1.101):

(root@localhost) [(none)]> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.102
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: binary.000003
          Read_Master_Log_Pos: 107
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 250
        Relay_Master_Log_File: binary.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 407
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2

Show processlist on the new master (192.168.1.102):

Notice the repl user has a connection with the thread id of 158 from 192.168.1.101.

+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| Id  | User        | Host                | db   | Command     | Time | State                                                                       | Info             |
+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
| 158 | repl        | 192.168.1.101:46643 | NULL | Binlog Dump |  103 | Master has sent all binlog to slave; waiting for binlog to be updated       | NULL             |
| 160 | root        | localhost           | NULL | Query       |    0 | NULL                                                                        | show processlist |
+-----+-------------+---------------------+------+-------------+------+-----------------------------------------------------------------------------+------------------+
 

Your developer or owner of the application is now able to point read and write activity to the new master server (192.168.1.102).

Thoughts about this process:

This really is the most simple and manual way to run a failover. The method above is fine and dandy in a perfect world but during an unplanned outage at 3am on a Saturday morning, after a night of partying while you’re on-call…. Well, you get it!

Future articles will be on better failover methodologies, but I had to start somewhere!

Additional Resources

MySQL MySQL Failover Strategy using State Management, introducing MPP - Part 1
MySQL 16.3.6. Switching Masters During Failover

» See All Articles by Columnist Chris Schneider








The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers