15May2008

Perl-Fu: Automating ILOM SSH Sessions

Posted by aastaneh under: Uncategorized.

We at USF Research Computing take pride in automating tasks as much as possible. We write script after script, tool after tool, to make our lives easier so we have more time to conquer the real challenges. One of the tasks we would like to automate with the new cluster on the way is talking to the administrative interfaces on the nodes. ILOM, or “Integrated Lights-Out Manager”, allows us to tell the machine to do real neat things remotely, like reboot and prepare to PXE. ILOMs have their own network interface and are accessible via SSH.

Brian and I thought- cool. Let’s script these operations through SSH then, and make our lives easier. If we can automate these tasks, it would be a lot easier to manage maintenance of 120 new machines.

More, after the jump.

Read the rest of this entry »

0 

13March2008

Ultimate Cluster Monitoring, Part I

Posted by aastaneh under: Uncategorized.

Brian and I have been throwing around the idea of implementing a webapp that would do the following:

  • Monitor Gigabit and Infiniband switch interfaces for throughput, types of traffic, errors, etc.
  • Monitor UPS’s for Voltage, Battery Capacity and Temperature.
  • Tie it together with SGE to see network performance on a job-by-job basis.
  • Link it to our current Ganglia installation to see status of individual nodes on a job-by-job basis.
  • Have all that information accessible in the same place and somehow have it look pretty.

Well, Brian didn’t have the time to implement it himself, so he bequeathed the daunting task to me. Let’s see how that turned out, after the jump.

Read the rest of this entry »

0 

24January2008

Librarian-NG: Solve your unresolved linkage dependancies FAST.

Posted by aastaneh under: Uncategorized.

We at Research Computing build and manage many kinds of software written in all kinds of languages (C, Fortran, R), and built with many types of compilers (gcc, Intel, PGI). Consequently, we have a ton of libraries. When building software to install on our systems, it usually takes a long time. Not all developers use automake, and therefore, it makes the process hard on us. Usually, during the linking stage of the build, the build system(make, usually) will abort, kicking and screaming about some function call whose library is not present.

If there was only a way to speed up the process..

Read the rest of this entry »

2 

16January2008

zfs_restore: the cool way to restore from backups

Posted by aastaneh under: Uncategorized.

Most sysadmins know that ZFS natively keeps snapshots of files, which makes it perfect for an incremental-backup solution. Unfortunately, when trying to restore a lost file using ZFS, (especially remotely in this case) the process can be rather complex (EDIT: not really complex or difficult at all… its just that it could be even easier), which is a shame considering it’s usefulness.

Until today.

Using the powers of ZFS, rsync, and Bash, I have devised a solution which makes accessing and restoring backups so simple, even normal users can do it!

Read the rest of this entry »

0 

14January2008

IPMI Monitoring for x4500 & Dell PowerEdge with Nagios

Posted by brs under: Uncategorized.

So, you’ve got some sweet new hardware in your server room and you have it up and running with your latest and greatest production software stack. How are you to monitor all the ins-and-outs of the hardware — fan speeds, chassis and CPU temperatures, power supply status, etc. — and even if you can get read outs of this information, what are good thresholds for every given metric?

Lucky for us, most new server hardware comes with some on-board hardware that provides an IPMI service. IPMI refers to Intelligent Platform Management Interface and it provides various mechanisms for chassis power control, system event logging, hardware monitoring, and even serial console access. On Dell systems, an integrated BMC or Board Management Controller provides the necessary hardware interface to provide the IPMI service while piggy-backed to one of the system’s NICs to provide remote accessibility. On the Sun x4500, the ILOM or Integrated Lights-Out Management unit provides the IPMI services through an attached service processor with its own NIC and on-board operating system (that happens to be Linux).

Read the rest of this entry »

0 

7December2007

Kickstart and ssh

Posted by brs under: Uncategorized.

Have you ever set up a remote, automated kickstart system only to have it fail on some esoteric piece of new-fandangled hardware? Were you then disappointed to find that while kickstart supports VNC connections, it did not allow ssh connections so that you could get a list of all the bleeding-edge, barely-supported hardware off the box and run some diagnostics in a shell? Well, be disappointed no more!

Read the rest of this entry »

1 

Browse

Calendar

May 2008
M T W T F S S
« Mar    
 1234
567891011
12131415161718
19202122232425
262728293031  

Categories

Links