14 January 2008

IPMI Monitoring for x4500 & Dell PowerEdge with Nagios

Posted by brs under: Uncategorized .

So, you’ve got some sweet new hardware in your server room and you have it up and running with your latest and greatest production software stack. How are you to monitor all the ins-and-outs of the hardware — fan speeds, chassis and CPU temperatures, power supply status, etc. — and even if you can get read outs of this information, what are good thresholds for every given metric?

Lucky for us, most new server hardware comes with some on-board hardware that provides an IPMI service. IPMI refers to Intelligent Platform Management Interface and it provides various mechanisms for chassis power control, system event logging, hardware monitoring, and even serial console access. On Dell systems, an integrated BMC or Board Management Controller provides the necessary hardware interface to provide the IPMI service while piggy-backed to one of the system’s NICs to provide remote accessibility. On the Sun x4500, the ILOM or Integrated Lights-Out Management unit provides the IPMI services through an attached service processor with its own NIC and on-board operating system (that happens to be Linux).

The really cool thing is that vendors who implement IPMI like Sun and Dell tend to do a pretty complete job populating all of the potential data points for the IPMI SDR (Sensor Data Record) and they also build-in factory specified thresholds for determining the failure of any particular component. All temperatures, fan speeds, status indicators, etc. have established nominal operating ranges and IPMI has a very easy way to determine if a device is operating within those established parameters.

The first thing that we’ll have to do to set up IPMI monitoring through Nagios is, of course, to make sure that you have a ready and running copy of Nagios running somewhere and that you are familiar with adding plugins to Nagios’ command list. If you don’t get started reading here: http://www.nagios.org/docs

The next thing you’ll need is a current installation of OpenIPMI and OpenIPMI tools (only necessary for Linux, in this case). It will be useful to have this installed on each Linux host with an IPMI adapter such as the Dell PowerEdge server. With CentOS, you need only issue the following yum commands to get this software:

[root@host ~]# yum install OpenIPMI OpenIPMI-tools

This will provide the necessary command line tools such as ‘ipmitool’ and the init script we’ll use to load the appropriate kernel modules. Once you’ve installed these packages, you can edit and run the following script to establish a base configuration for your IPMI adapter:

#!/bin/bash
# Amin Astaneh, Research Computing, USF
# impiscript: configures ipmi for network access on a node.
CHANNEL=1
#
# First, ensure that all the modules are loaded
service ipmi restart
#
# Use NIS to determine what IP address to use. This merely adds 200 to the third octet.
HOST=`hostname`
#IPADDR=`ypcat hosts.byaddr | awk "/$HOST/ { str=\\$1; split(str,ip,\".\"); printf \"%s.%s.%s.%s\", ip[1], ip[2], ip[3] + 200, ip[4] }"`
#
# In this case, just hard-code whichever address you want for the adapter
#
IPADDR=x.x.x.x
NETMASK=y.y.y.y
GATEWAY=z.z.z.z
#
ROOT_PW='XXXXXXXXXX'
NAGIOS_PW='XXXXXXXXXX'
#
# The package name is OpenIPMI-tools
ipmitool lan set $CHANNEL ipaddr $IPADDR
ipmitool lan set $CHANNEL ipsrc static
ipmitool lan set $CHANNEL netmask $NETMASK
ipmitool lan set $CHANNEL defgw ipaddr $GATEWAY
ipmitool user set password 2 "$ROOT_PW"
ipmitool lan set $CHANNEL access on
ipmitool user set name 3 nagios
ipmitool user set password 3 "$NAGIOS_PW"
ipmitool channel setaccess 1 3 privilege=3 ipmi=on
ipmitool mc reset cold
#

I’ve seen that in many cases, you’ll need to completely power-cycle the box for the IPMI adapter to actually work. This can mean either a full reboot or a complete unplugging of the system power.

Before we get down to playing with this device, what about configuring the ILOM on the x4500? We’ll, you’ll first want to read through the Sun Lights-Out Manager (ILOM) Administration Guide provided here http://docs.sun.com/app/docs/coll/x4500-rels-ilom. You’ll want to set up a basic configuration where you can log in via ssh as administrator to the device. This means we’ll only be preoccupied with setting up a user account. Go ahead and log into the device and issue the following commands to create the ‘nagios’ user:

-> cd /SP/users
-> create nagios
-> cd nagios
-> set password='XXXXXXXXX'
-> set role=Operator
-> exit

You’ll now be able to use the ‘nagios’ users on both the Dell BMCs and the Sun ILOM to access IPMI SDR information for monitoring the devices. Lets have a look at some sample output from ‘ipmitool’ (see ipmitool(1) in the man pages for information on command line syntax).

Here is an example using ‘ipmitool’ on a Dell PowerEdge server:

[user@host ~]$ ipmitool -H x.x.x.x -U nagios -P 'XXXXXXXXX' -L OPERATOR -I lan sdr list all
Temp | disabled | ns
Temp | disabled | ns
Ambient Temp | 25 degrees C | ok
CMOS Battery | 0x00 | ok
VCORE | 0x01 | ok
VDDIO | 0x01 | ok
VDDA | 0x01 | ok
VTT | 0x01 | ok
VCORE | 0x01 | ok
VDDIO | 0x01 | ok
...

The nice thing about ’sdr list full’ is that all available metrics are read and processed against the built-in threshold values. This makes for very easy parsing. The fields with ‘ns’ in the 3rd column are obviously unavailable so we can grep them out pretty easily. This standard is also followed for the x4500. The only difference is the LAN interface used (see option ‘-I lan’ in the above command line). For the x4500, we’ll be using the LANplus interface. Here’s an example:

[user@host ~]$ ipmitool -H x.x.x.x -U nagios -P 'XXXXXXXXX' -L OPERATOR -I lanplus sdr list all
proc.p0.t_core | 51 degrees C | ok
proc.p1.t_core | 49 degrees C | ok
dbp.t_amb | 25 degrees C | ok
io.front.t_amb | 39 degrees C | ok
io.rear.t_amb | 40 degrees C | ok
proc.front.t_amb | 29 degrees C | ok
proc.rear.t_amb | 34 degrees C | ok
ft0.prsnt | 0x02 | ok
ft0.f0.speed | 7700 RPM | ok
ft0.f1.speed | 7800 RPM | ok
...

Well anyway, Nagios plugins in bash are incredibly easy to write and since we have all of the output we’ll ever need for monitoring purposes, lets just make the ‘ipmitool’ commands above the basis for our monitoring scripts. Here’s one for the Dell’s:

#!/bin/bash
###################################
# check_x4500
#
# Checks status of machine via ipmi
#
# Amin Astaneh, aastaneh@rc.usf.edu
# 10-01-2008
###################################
# We can add 200 to the third octet for our Dell hosts so that we don't have to create a separate nagios host for the ipmi adapter
HOST=`echo $1 | awk -F'.' '{ printf "%s.%s.%s.%s", $1, $2, $3 + 200, $4 }'`
LOGS=/usr/share/nagios/logs
IPMI="ipmitool -I lan -H $HOST -U nagios -P rc_ipmi_info -L OPERATOR"
RESULTS=$($IPMI sdr list all | egrep -v '^.*\|.*\|.*(ok|ns)$' | awk -F'|' '{ print $1":"$2":"$NF}');
#
if [ -n "$RESULTS" ]; then
echo "######### Status Display ###########" > $LOGS/$HOST.log
echo $RESULTS >> $LOGS/$HOST.log
echo "######### System Event Log ###########" >> $LOGS/$HOST.log
$IPMI sel list >> $LOGS/$HOST.log
echo "WARNING: <a href=\"https://nagios.rc.usf.edu/logs/$HOST.log\">See Logfile</a>";
exit 1;
else
echo "OK: All Components Online."
exit 0;
fi

In this script, we simply look for lines in the output of sdr list full that contain something other than ‘ok’ or ‘ns’ in the 3rd column. If a result is returned, we know something is awry and we print that line containing a URL to a log file along with WARNING or CRITICAL (which Nagios uses to determine if there is a problem). The log file contains the spurious outputs from the command as well as a listing of the system event log.

With a couple small modifications, this same script can be used with the x4500 ILOM as we see here:

#!/bin/bash
###################################
# check_x4500
#
# Checks status of machine via ipmi
#
# Amin Astaneh, aastaneh@rc.usf.edu
# 10-01-2008
###################################
#
HOST=$1
LOGS=/usr/share/nagios/logs
IPMI="ipmitool -I lanplus -H $HOST -U nagios -P rc_ipmi_info -L OPERATOR"
RESULTS=$($IPMI sdr list all | egrep -v '^.*\|.*\|.*(ok|ns)$' | \
awk -F'|' '{ print $1":"$2":"$NF}');
#
if [ -n "$RESULTS" ]; then
echo "######### Status Display ###########" > $LOGS/$HOST.log
echo $RESULTS >> $LOGS/$HOST.log
echo "######### System Event Log ###########" >> $LOGS/$HOST.log
$IPMI sel list >> $LOGS/$HOST.log
echo "CRITICAL: <a href=\"https://nagios.rc.usf.edu/logs/$HOST.log\">See Logfile</a>";
exit 1;
else
echo "OK: All Components Online."
exit 0;
fi

Here’s the same thing only the ILOMs are defined as their own host in Nagios (so we don’t need to do any address translation) and we use the LANplus interface to communicate with the device.

Now, all thats left to do is to add this plugin to Nagios and assign it as a service to the appropriate hosts. Now you have complete monitoring of all the hardware’s vital statistics while utilizing the factory-specified ranges for determining thresholds with a very minimal amount of scripting. A big thanks to Amin for getting these scripts written.

Leave a Reply

You must be logged in to post a comment.

Browse

Calendar

November 2009
M T W T F S S
« Apr    
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Categories

Links