Monitoring Exadata Storage Servers with the CellCLI

Oracle Exadata Database Machine provides a very powerful all-inclusive solution for highly available databases and extremely fast IO access to data.  It combines Oracle’s powerful Grid Infrastructure and Real Application Cluster database solution with the power of the Exadata Storage Server technology in a pre-configured configuration. 

Exadata provides a platform that is a solution for systems ranging from data warehouses doing large scan intensive operations to online transaction systems needing high amounts of concurrency.

The Exadata Storage Servers are based on the 64-bit Intel-based Sun Fire Servers and they are shipped preloaded with Oracle Enterprise Linux x86_64 operating systems, the Exadata Storage Server software and InfiniBand protocol drivers.

While Exadata consistently provides amazing performance, like anything Oracle, it is important that DBAs are able to monitor the Exadata Storage Servers for both potential performance issues and errors.

Many aspects of the Exadata Storage Servers can be monitored including current active requests, hardware sensors, disk I/O errors, network errors, free space and metrics that are being managed.  In this article, we’ll be focusing on using and monitoring metrics using the CELLCLI command line tool. 

To set up and configure the cells for alerts and notifications you should be logged into the Exadata Storage Server(s) using the cellmonitor account.

Overview of Metrics Monitoring

First, let’s take a look at how the Exadata storage server monitoring works.  The primary process that manages Exadata storage servers is CELLSRV.  It will periodically record important metrics on components like the CPUs, cell disks, grid disks, flash cache and IORM (IO Resource Management).  These metrics are initially stored in memory.  The MS (Management Server) retrieves these metrics from CELLSRV and keeps a subset of the values in memory and once an hour writes a history to an internal disk repository.  The retention period for these metrics and alert information defaults to seven days and can be controlled by a specific setting on the storage server called metricHistoryDaysIt is changed using an ALTER CELL command in CELLCLI on each storage server.

Viewing Metric Information

At the center of the monitoring solutions is metrics and each of the metrics have the following significant attributes

  • name
  • metricObjectName – the specific object being measured such as the specific cell disk
  • objectType  –
    • IORM_CONSUMER_GROUP
    • IORM_DATABASE
    • IORM_CATEGORY
    • CELL
    • CELLDISK
    • CELL_FILESYSTEM
    • GRIDDISK
    • HOST_INTERCONNECT
    • FLASHCACHE
  • unit
    • number
    • percentage
    • F (fahrenheit)
    • C (celsius)
  • metricValue
  • metricType
    • cumulative (since it was created)
    • instantaneous (at the time the metric was collected)
    • rate (change over time)
    • transition (collected when the metric value changed)

There are several naming conventions followed that are worth knowing, to help us understand what we are looking at (or for), when managing the Exadata Storage Server metrics.

Metric names are prefixed as follows:

CL_ (cell)

CD_ (cell disk)

GD_ (grid disk)

FC_ (flash cache)

DB_ (database)

CG_ (consumer group)

CT_ (category)

N_ (interconnect network)

IO related metrics are further identified by codes that help to identify the operation(s) being done

IO_RQ (requests)

IO_BY (number of MB)

IO_TM (latency)

IO_WT (wait time)

They might also include a code to indicate reads (_R) or writes (_W), followed by an indicator of large (> 128k) _LG or small (<=128K) and a code for requests, seconds.  While this all may sound complicated, after working with the names for a period of time the names actually do start to make sense.

For example:

CD_IO_RQ_W_LG would be the number of large write requests on a cell disk. 

GD_IO_BY_R_SM_SEC is the number of MB of small block I/O reads per second on a grid disk.

 

To see the specific details about any of the metrics, use the LIST METRICDEFINITION command.  For example, if you would like to see the detailed information of all metrics for celldisks – enter the following in CELLCLI>

	LIST METRICDEFINITION WHERE objectType='CELLDISK' DETAIL

To view the history of any given metric, use the LIST METRICHISTORY command in CELLCLI.  To see the current value of a metric use LIST METRICCURRENT.  The following command would show the metric history of flash cache metrics collected after a specific date and time

	LIST METRICHISTORY WHERE name like 'FC_.*' and collectionTime > '2013-01-31T13:15:30-08:00'

Or, to see the current value of metrics for all grid disks:

	LIST METRICCURRENT WHERE objectType='GRIDDISK'

Working with Metrics Alerts

As administrators, not only can we view the metrics and the metric history, we are also able to define alert thresholds (both warning and critical) on many of these metrics along with I/O error counts, memory utilization and IORM metrics. Additionally, once an alert has been generated, actions taken to evaluate and resolve the alert can be tracked through the CELLCLI.

Alerts generated by the Exadata Storage Servers have the following attributes:

  • alertSource
    • BMR
    • Metric
    • ADR (automatic diagnostic repository_
  • severity
    • critical
    • warning
    • info
    • clear
  • alertType
    • stateful
    • stateless
  • metricObjectName
  • examinedBy
  • metricName
  • name
  • description
  • alertAction (recommended action to perform)
  • alertMessage (brief information)
  • failedMail (intended recipient of a failed notification)
  • failedSNMP (intended SNMP subscriber of a failed notification)
  • beginTime
  • endTime
  • notificationState
    • 0 (never tried)
    • 1 (sent successfully)
    • 2 (retrying – up to 5 times)
    • 3 (five failed retries)

To learn more about the details of the alert definitions use the LIST ALERTDEFINITION command in CELLCLI and indicate which attributes you would like to see.

LIST ALERTDEFINITION ATTRIBUTES name, metricName, description

To see warning level alerts that have been generated, and not yet examined by an administrator:

	LIST ALERTHISTORY where examinedBy = ' ' and severity = 'warning' DETAIL

To mark an alert as examined:

	ALTER ALERTHISTORY nnnn examinedBy="Karen" (where nnnn is the alert id #)

To create thresholds on metrics, indicate the name of the metric, the warning and critical levels, the comparison operator the number of occurrences and observation time using the CREATE THRESHOLD command.  The observation attribute indicates the number of measurements that the metric values are averaged over.

For example, to create a threshold on waits for small IO requests  for a IORM category called online that would give you a warning at 2500 milliseconds or higher and a critical at 4000 milliseconds or higher you would enter something like:

	CREATE THRESHOLD ct_io_wt_sm_rq.online warning=2500, critical=4000, comparison='>', occurrences=2, observation=5

About Alert Email Notifications

In order to actually have the Exadata Storage Servers send notifications via email (or alternately SNMP) each of the servers has to be configured with the appropriate settings.  This is done using the ALTER CELL command in CELLCLI.

ALTER CELL    smtpServer='mailserver.somewhere.com', -

              smtpFromAddr='exadata.cell01@somewhere.com', -

              smtpPwd='email_password', -

              smtpToAddr='someone@somewhere.com', -

              notificationPolicy='critical,warning,clear', -

              notificationMethod='mail'

There is also a verification command that can be run to test that the storage server can actually reach the mail server.

ALTER CELL VALIDATE MAIL

Watching for Undelivered Alerts

Once the alerts and notifications have been set up, it is still important to periodically check the storage servers just to make sure any alerts that have been generated have actually been delivered (via email and/or to Grid or Cloud Control). 

LIST ALERTHISTORY where notificationState != 1 and examinedBy=''

If there are undelivered alerts double check the cell configuration, agent status and network connectivity.

Conclusion

The Oracle Exadata Database Machine is easily one of the fastest growing product lines for Oracle and with proven performance and availability.  While we do face a learning curve to learn to fully manage and monitor the systems, it’s easy to see that the Exadata Storage Server software provides a set of very powerful options that allow us to configure, manage and monitor the performance and status of the storage servers in an Oracle Exadata Database Machine. 

See all articles by Karen Reliford

Karen Reliford
Karen Reliford
Karen Reliford is an IT professional who has been in the industry for over 25 years. Karen's experience ranges from programming, to database administration, to Information Systems Auditing, to consulting and now primarily to sharing her knowledge as an Oracle Certified Instructor in the Oracle University Partner Network. Karen currently works for TransAmerica Training Management, one of the foremost Oracle Authorized Education Centers (OAEC) in the Oracle University North America region. TransAmerica Training Management offers official Oracle and Peoplesoft Training in Coral Gables FL, Fayetteville AR, Albuquerque NM, Providence RI and San Juan PR. Karen has now been teaching Oracle for Oracle University for more than 15 years. Karen has attained her Certified Technical Trainer designation along with several Oracle certifications including OCP-DBA, OCP-Internet Developer, Oracle Expert - Oracle 10g RAC and Oracle Expert - Oracle Application Express (3.2). Additionally, Karen achieved her Oracle 10g Oracle Certified Master (OCM) in 2008. Karen was raised in Canada, and in November 2009 became a US Citizen. Karen resides in Columbus OH with her husband, Ron along with their 20 pets, affectionately referred to as the "Reliford Zoo".

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends & analysis

Latest Articles