Advanced Load Balancing: The Load Balancing Advisor (LBA)
In a RAC environment, its not unlikely that one node may
become overwhelmed by application requests for resources. For example, lets
assume that a two-node clustered database currently has 100 user sessions
connected across both nodes in the cluster, and that the database is using
standard server-side load balancing. If there is a sudden storm of 200
additional connections, the listeners on the two nodes will simply distribute
them evenly across both nodes, 100 to each node, resulting in 150 connections
per node. However, its possible that node RACDB2 is actually much busier than the other
node. As a result, node RACDB2 will
most likely be completely overwhelmed by those new connections resource demands,
while node RACDB1
remains relatively underutilized with plenty of additional resources.
The good news is that Oracle 10gR2 now provides an advanced
method to overcome this imbalance: the Load Balancing Advisor (LBA). The
LBA will calculate how much work each instance in the clustered database has
been asked to do. The LBA will then make sure that any new incoming work is
routed to the instance(s) that are least busy. This determination takes place on
a service-by-service basis across all instances that are providing
resources to each service.
The LBA considers the following factors when determining how
to perform this balancing act:
-
Are there any differences in processing power between
nodes?
-
Are there any sessions that are currently blocked from execution
because of waits?
-
Have any failures occurred on a node that might block
processing from continuing?
-
Are there any services that are competing for resources,
and have those services been granted a different priority to complete
their tasks at hand?
Not only does this help to balance out the total workload
across all instances in the cluster, it also insures that one node wont be
overwhelmed by requests for service by its primary application(s).
As part of its Fast Application Notification (FAN) event
transmission features, Oracle 10g RAC uses Oracle Notification Services
(ONS) to communicate the status of cluster resources to all participating node
applications in the cluster. In Oracle 10gR2, FAN added a new event type, SERVICE_METRIC,
whose event payload contains information about the relative workload of each
node in the RAC cluster, and its this information that the Load Balancing
Advisory uses to determine how to route new connections throughout the
clustered databases instances.
Oracle 10gR2 supplies an upgraded DBMS_SERVICE.MODIFY_SERVICE procedure that
modifies an existing RAC service so that it can utilize the Load Balance
Advisory for service-by-service load balancing. This procedure provides two
crucial parameters, GOAL and CLB_GOAL, that
interact to determine what methods (if any) Oracle 10gR2 will utilize to
perform load balancing. For example, if I set the GOAL parameter to either the GOAL_SERVICE_TIME
or the GOAL_THROUGHPUT
enumerated constants for a specific RAC service, Oracle 10gR2 will activate the
Load Balancing Advisory for load balancing of that services incoming
connections. Table 1 explains the difference between these two load
balancing targets.
Table 1. Oracle 10gR2 Load Balancing Advisory (LBA) Service Goals
|
|
LBA Goal
|
Explanation
|
GOAL_NONE
|
Setting DBMS_SERVICE.GOAL to
this value disables the LBA.
|
GOAL_SERVICE_TIME
|
The LBA calculates a
weighted moving average of the total elapsed time for completed work
plus the bandwidth thats available to the service to calculate the service
goodness. This goal is ideal for services whose workload may change
dramatically over a short period of time, e.g. an application that services a
clicks and mortar store that provides customer self-service through an
internet-based shopping web site.
|
GOAL_THROUGHPUT
|
The LBA calculates a
weighted moving average of throughput (i.e. the rate at which work
is completed) in addition to the bandwidth available to the service to
calculate the service goodness. This goal is best suited for long-duration
tasks that are typically queued to run serially, e.g. scheduled
jobs that handle large batches of transactions.
|
In addition, the CLB_GOAL parameter provides yet
another load balancing method: It tells the LBA how to perform connection load
balancing. See Table 2 for a list of its acceptable enumerated constant
values and the impact of setting either.
Table 2. Oracle 10gR2 Connection Load Balancing (CLB) Goals
|
|
CLB Goal
|
Explanation
|
CLB_GOAL_SHORT
|
The Load Balancing Advisory will be used for connection load
balancing only if it is enabled (i.e. set to other than GOAL_NONE).
If the LBA has been disabled, connection load balancing will utilize abridged
advice determined by CPU utilization.
|
CLB_GOAL_LONG
|
Connection load balancing
will be determined by first tallying the total number of connections per
instance, and then by counting the number of sessions per each service.
Oracle recommends using this setting for services whose applications tend to
connect for long periods of time (e.g. Oracle Forms). The Load Balancing
Advisory can be used in conjunction with this setting as long as the
connection pool has been sized to accommodate gravitation within the pool
without adding or subtracting connections. Oracle recommends this option as
the most efficient design.
|
Depending on the settings for these two parameters, the LBA
uses the metrics shown in Table 3 to make its determination of which
instance should receive the next set of incoming connections:
Table 3. Oracle 10gR2 Load Balancing Advisory (LBA) Metrics
|
|
LBA Goal
|
LBA Metric Used
|
Explanation
|
LONG
|
Session Count By Instance
|
The LBA uses this metric to
evenly distribute sessions across the clusters nodes. This works best for
services that are distributed uniformly across the clusters instances,
and the nodes have similar capacities.
|
LONG
|
Node Run Queue Length
|
If the service uses only a subset
of the RAC instances, and the nodes have dissimilar capacities, then
the LBA will attempt to place more sessions on the node with the least load at
the time that the connection was created.
|
SHORT
|
Goodness By Service
|
This method incorporates a
ranking for the quality of service that the service is experiencing at an
instance level, including whether access has been restricted from an
instance. It attempts to prevent a listener from routing all connections to
the same instance in between updates to the goodness value because each
listener will adjust its local goodness rating computed as a delta value -
as connections are distributed across the cluster. This delta represents the
average of resource time connections are consuming when they use that
service. Also, to prevent a storm surge of logins to the least busy
instance, Oracle 10gR2 will compute a threshold delta if the goodness
rating delta value is too low.
|
Proof of Concept: LBA in Oracle 10gR2 RAC
To enable my RAC database environment for LBA, Ill first
activate three different levels of load balancing using DBMS_SERVICE.MODIFY_SERVICE against the three new
services I just created:
-
The ADHOC service
has been set up with no LBA load balancing (GOAL=GOAL_NONE) and no connection load balancing,
either (CLB_GOAL=CLB_GOAL_LONG).
-
The DSS service
will use the LBA for load balancing with optimized service time as its
goal (GOAL=GOAL_SERVICE_TIME
and CLB_GOAL=CLB_GOAL_SHORT).
-
Finally, the OLTP
service will use the LBA for load balancing with a goal of throughput (GOAL=GOAL_THROUGHPUT
and CLB_GOAL=CLB_GOAL_SHORT).
Listing
5 shows the code Ive used to activate LBA management for these
services, as well as a query against the DBA_SERVICES view plus the resulting output that
verifies the proper settings for GOAL and CLB_GOAL for these services.
Now that these three services are configured for LBA, Ill
use a combination of shell scripts to place a load on one of the two nodes for
this RAC clustered database and then attempt to connect to both nodes in the
cluster using the three RAC services. Ill first execute the ConstantLoadGenerator.sh
shell script to create 100 sessions that execute a SQL script that places a
moderately heavy load on the databases CPU, and then Ill execute shell script
RandomLoadGenerator.sh to simulate
40 additional user connections for one of the three services.
How can I determine what the impact will be of running these
tests? Oracle 10gR2s GV$SERVICEMETRIC
global view shows each services goodness on each instance in the RAC
clustered database, the services predicted goodness (or delta), and summary
statistics on how much resources the service consumed during the time period. Listing 6
shows the query Ive constructed against that view; heres the result of that
query before I started my evaluations:
Current Service-Level Metrics
(From GV$SERVICEMETRIC)
Pred- CPU Elpsd # 0f
icted Time Time User
Good- Per Per Calls DBTime
Service Inst Good ness Call Call Per Per
Name Start Time End Time ID ness Incr Flags (mus) (mus) Second Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:06:41 15:06:48 1 0 1 0 0 0 0 0
15:06:23 15:06:29 2 0 1 0 0 0 0 0
15:05:02 15:06:02 1 0 1 0 0 0 0 0
15:04:37 15:05:37 2 0 1 0 0 0 0 0
DSS 15:06:41 15:06:48 1 100 100 4 0 0 0 0
15:06:23 15:06:29 2 100 100 4 0 0 0 0
15:05:02 15:06:02 1 100 100 4 0 0 0 0
15:04:37 15:05:37 2 100 100 4 0 0 0 0
OLTP 15:06:41 15:06:48 1 0 100 4 0 0 0 0
15:06:23 15:06:29 2 0 100 4 0 0 0 0
15:05:02 15:06:02 1 0 100 4 0 0 0 0
15:04:37 15:05:37 2 0 100 4 0 0 0 0
Heres the results of the first test against the ADHOC RAC
service. While my tests executed, I noticed that Oracle 10gR2 continued to
route connections to each node in the cluster regardless of the load on each
instance, and the number of connections remained relatively constant on
both nodes as new connections arrived. Here are the results of a query against GVSERVICE_METRIC
for that test:
>>> Output after ADHOC unit testing:
Current Service-Level Metrics
(From GV$SERVICEMETRIC)
Pred- CPU Elpsd # 0f
icted Time Time User
Good- Per Per Calls DBTime
Service Inst Good ness Call Call Per Per
Name Start Time End Time ID ness Incr Flags (mus) (mus) Second Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:10:48 15:10:52 1 34 1 0 6481 69808 65 454
ADHOC 15:10:36 15:10:42 2 26 1 0 1106 3419 92 31
ADHOC 15:09:02 15:10:02 1 34 1 0 0 0 0 0
ADHOC 15:08:38 15:09:38 2 26 1 0 0 0 0 0
DSS 15:10:48 15:10:52 1 100 100 4 0 0 0 0
DSS 15:10:36 15:10:42 2 100 100 4 0 0 0 0
DSS 15:09:02 15:10:02 1 100 100 4 0 0 0 0
DSS 15:08:38 15:09:38 2 100 100 4 0 0 0 0
OLTP 15:10:48 15:10:52 1 0 100 4 0 0 0 0
OLTP 15:10:36 15:10:42 2 0 100 4 0 0 0 0
OLTP 15:09:02 15:10:02 1 0 100 4 0 0 0 0
OLTP 15:08:38 15:09:38 2 0 100 4 0 0 0 0
For the tests I ran against the DSS and OLTP RAC services, however, I noticed that Oracle
10gR2 definitely tended to route connections to the node that was least loaded,
and this was reflected in the query results against the GVSERVICE_METRIC global view for these tests:
>>> Output after DSS unit testing:
Current Service-Level Metrics
(From GV$SERVICEMETRIC)
Pred- CPU Elpsd # 0f
icted Time Time User
Good- Per Per Calls DBTime
Service Inst Good ness Call Call Per Per
Name Start Time End Time ID ness Incr Flags (mus) (mus) Second Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:12:48 15:12:53 1 0 1 0 0 0 0 0
ADHOC 15:12:37 15:12:42 2 0 1 0 0 0 0 0
ADHOC 15:11:06 15:12:06 2 0 1 0 0 0 0 0
ADHOC 15:11:02 15:12:02 1 0 1 0 49585 322550 2 54
DSS 15:12:48 15:12:53 1 900 9947 0 26051 200992 14 273
DSS 15:12:37 15:12:42 2 2050 9901 0 4301 11800 56 66
DSS 15:11:06 15:12:06 2 2050 9901 0 885 916 2 0
DSS 15:11:02 15:12:02 1 900 9947 0 0 0 0 0
OLTP 15:12:48 15:12:53 1 0 100 4 0 0 0 0
OLTP 15:12:37 15:12:42 2 0 100 4 0 0 0 0
OLTP 15:11:06 15:12:06 2 0 100 4 0 0 0 0
OLTP 15:11:02 15:12:02 1 0 100 4 0 0 0 0
>>> Output after OLTP unit testing:
Current Service-Level Metrics
(From GV$SERVICEMETRIC)
Pred- CPU Elpsd # 0f
icted Time Time User
Good- Per Per Calls DBTime
Service Inst Good ness Call Call Per Per
Name Start Time End Time ID ness Incr Flags (mus) (mus) Second Second
-------- ---------- ---------- ----- ------- ------- ------- --------- --------- --------- ---------
ADHOC 15:14:48 15:14:53 1 0 1 0 0 0 0 0
ADHOC 15:14:36 15:14:42 2 0 1 0 0 0 0 0
ADHOC 15:13:06 15:14:06 2 0 1 0 0 0 0 0
ADHOC 15:13:02 15:14:02 1 0 1 0 0 0 0 0
DSS 15:14:48 15:14:53 1 100 100 4 0 0 0 0
DSS 15:14:36 15:14:42 2 100 100 4 0 0 0 0
DSS 15:13:06 15:14:06 2 100 100 4 0 0 0 0
DSS 15:13:02 15:14:02 1 100 100 4 59223 173148 0 3
OLTP 15:14:48 15:14:53 1 5200 645 0 7799 53529 46 246
OLTP 15:14:36 15:14:42 2 3800 524 0 5354 16001 55 88
OLTP 15:13:06 15:14:06 2 3800 524 0 0 0 0 0
OLTP 15:13:02 15:14:02 1 5200 645 0 0 0 0 0
Conclusion
With the advent of the Load Balancing Advisory in
Oracle 10gR2, its now possible to insure that an unexpectedly dramatic
increase in a RAC services workload does not overwhelm any single node of an
Oracle 10gR2 RAC clustered database. Oracle 10gR2 leverages proven, existing
RAC components like FAN and ONS event publishing to implement this
enhancement. Finally, several new and enhanced data dictionary views provide
excellent feedback to measure the efficiency of the Load Balancing Advisory.
Download the SQL scripts and shell scripts for this article.
References and Additional Reading
Even though Ive hopefully provided enough technical
information in this article to encourage you to explore with these features, I
also strongly suggest that you first review the corresponding detailed Oracle
documentation before proceeding with any experiments. Actual implementation of
these features should commence only after a crystal-clear understanding exists.
Please note that Ive drawn upon the following Oracle 10gR2 documentation for
the deeper technical details of this article:
B14197-03 Oracle 10gR2
Clusterware and Real Application Clusters Administration and Deployment Guide
B14203-08 Oracle 10gR2
Clusterware and Real Application Clusters Installation Guide for Linux
B14212-02 Oracle 10gR2 Net
Services Administrators Guide
B14213-01 Oracle 10gR2 Net
Services Reference
B14237-02 Oracle 10gR2
Reference
B14258-01 Oracle 10gR2
PL/SQL Packages and Types Reference
B14210-02 Oracle 10gR2
High Availability Overview
B25159-01 Oracle 10gR2
High Availability Best Practices
And these MetaLink documents are also invaluable to
obtaining a deeper understanding of how the Load Balancing Advisory has been
implemented in Oracle 10gR2:
226880.1 Configuration of Load
Balancing and Transparent Application Failover
»
See All Articles by Columnist Jim Czuprynski