13 March 2008

Ultimate Cluster Monitoring, Part I

Posted by aastaneh under: Uncategorized .

Brian and I have been throwing around the idea of implementing a webapp that would do the following:

  • Monitor Gigabit and Infiniband switch interfaces for throughput, types of traffic, errors, etc.
  • Monitor UPS’s for Voltage, Battery Capacity and Temperature.
  • Tie it together with SGE to see network performance on a job-by-job basis.
  • Link it to our current Ganglia installation to see status of individual nodes on a job-by-job basis.
  • Have all that information accessible in the same place and somehow have it look pretty.

Well, Brian didn’t have the time to implement it himself, so he bequeathed the daunting task to me. Let’s see how that turned out, after the jump.

The first step was to figure out how to harvest all of this information. Lucky for us, our Force10 Gigabit and Cisco Infiniband switches talk SNMP. I put two and two together, and I realized that MRTG was perfect for this task. The only problem was, most MRTG installations (and the config file generator) produces one giant config file. When Apache serves you the data in your browser, you have a ton of graphs to sift though. What a pain. So, my solution to this problem was this-

  • Create a single MRTG config file for each service per interface per host. We have quite a few switches and UPS devices; meaning- lots of files. Oh, configure them to use rrdtool.
  • Write templates for each service you want to monitor.
  • Hand-hack a perl script per type of device to generate all the config files using the templates and regular expression replacement. Each perl script reads a simple newline-delimited flat file containing all the hosts you want to monitor for a particular type of device. Each output file should be named using the hostname, interface, and service.
  • Generate a global MRTG config file that reads in all the small config files and generates .rrd files. Configure crontab to run MRTG against the global file every 5 minutes.
  • Tie it all together using the 14all.cgi script by making symlinks to that script with the naming convention.

Don’t worry. I’ll give you an simple example.

1. Generate the config files

You want to monitor all the incoming and outgoing bits on each active interface on a switch. Here’s a template that will do this: bitstemplate. I would take a look at this, if I were you ;-)

See HOST and INTERFACE in that file? We are going to replace those for each host we specify and for each active interface the host has. Here’s a perl script that will generate the files for you: genswitchconf. It’s pretty short, and you can add more templates to load for a more comprehensive solution.

One more thing- make a file called switchhosts and add one hostname per line. When you run the script, the perl script will determine what interfaces are active for each host, and generate a config file for each one.

2. Tell mrtg to use the config files

Copy all the config files to a location, perhaps /etc/mrtg. Next, create a file called /etc/mrtg.conf using this wicked cool one-liner:
for i in `ls /etc/mrtg/*.cfg`; do echo "Include: $i">> /etc/mrtg.cfg; done

Then put this in crontab using crontab -e:
*/5 * * * * env LANG=C /path/to/mrtg /etc/mrtg.cfg --logging /var/log/mrtg.log

Your .rrd files should be automagically be generated. Check the logfile for errors.

3. Use 14all.cgi to make it easily web-accessible

This perl script is cool. Stick it in the /cgi-bin/ of your webspace, and configure it to use symlinks to load the desired metric. This ain’t an Apache or MRTG/14all.cgi tutorial, so the logistics of this is outside the scope of this article. Next, use this even more awesome one-liner to create your symlinks:
for i in `ls /etc/mrtg/*.cfg`; do ln -s /usr/lib/mrtg/cgi-bin/14all.cgi /usr/lib/mrtg/cgi-bin/`basename $i .cfg`.cgi &> /dev/null; done

4. Write a perl script to generate html to browse the device

I have a sample for the current example: genswitchhtml. When you run it, you should get a single html file per switch, containing a table which links to the 14all.cgi with the desired host and interface. Copy these to your webspace. I’d write an index for all these, if I were you.

5. Enjoy.

It’s a basic solution, but it gets the job done. I generated config files for 8 switches, a ton of UPSs, and a few Infiniband switches, making a one-stop shop for looking at performance metrics for all these devices. In the next article, I’ll show you how to use SGE to your advantage so you can see metrics on a per-job basis. Have Fun!

-Amin

Leave a Reply

You must be logged in to post a comment.

Browse

Calendar

November 2009
M T W T F S S
« Apr    
 1
2345678
9101112131415
16171819202122
23242526272829
30  

Categories

Links