Csadmin
Emergency Contact Information
Emergencies
Network Outage - 9:45AM - 8/26/08
The network experienced an "outage" this morning at approximately 9:45. This is the result of a 'pathological' user - some host on Switch 16 is jamming the network and blocking almost all other traffic. There are 24 ports on Switch 16, so we are tracking them down right now. In the meantime, we've restored service by disconnecting the uplink for that switch; this means other users on Switch 16 have no network at all. As soon as we can isolate the problem host, we'll reconnect everyone else on that switch.
Athena Reboot - 9:15AM - 8/22/08
Olsson Hall experienced a power 'blip' this AM - the UPS which Athena was on had a failed battery, and the blip was enough to cause the server to reboot. We took the opportunity to move Athena to a new UPS. Athena booted back up normally and all systems are now normal.
Scheduled Downtimes
Radio and Sunfire Cluster Rebuild - 8/26-8/27
The Radio and Sunfire Clusters are down today for an upgrade from Fedora Core 6 (deprecated and not getting updates) to Ubuntu 8.04 LTS. The department is now using Ubuntu as our standard distribution, and these machines need to have a matching software environment.
The PBS queues for both systems have been stopped and disabled, and logins are blocked. We will re-enable both as soon as the clusters are back up.
Power Nodes Downtime - 8/13-8/15 - UPDATE (8/15)
Power[1..3,5,6] have been rebuilt and are available; power4 was rebuilt, but is having hardware issues - we are verifying the hard disk now and will run an extensive memtest_x86 scan to verify that there are no errors in the bus, CPU or memory.
Power Nodes Downtime - 8/13-8/15 - UPDATE (8/14)
Power[1..4] have been updated to the new Ubuntu distro, and memory has been increased to 4GB/machine. Please log on as soon as possible to test your code and be sure tools work as you expect them to (eg, SVN). Please notify us immediately if this breaks your environment badly.
Power Nodes Downtime - 8/13-8/15
The "Power" nodes - a 'public' group of interactive linux nodes - need to be upgraded to Ubuntu 8.04. Several nodes have hardware issues as well, so we will be doing some hardware maintenance and consolidating.
We will be rebuilding two nodes at a time, starting next Wednesday:
Power 1,2 - Wednesday 8/13
Power 3,4 - Thursday 8/14
Power 5,6 - Friday 8/15
We are "rolling" through these nodes to give regular users an opportunity to test their work in the new environment before we upgrade all the nodes - if the new environment causes major breakage for you, please let us know immediately.
The rebuilds will start at approximately 9AM and take roughly an hour to complete. We will set a block on additional logins 24 hours before, and we will halt the systems with a 30-minute delay; if you are logged on when it goes down, you will get a warning.
The disks will be wiped completely - so if you have any data stored locally (eg, a crontab, /localtmp/) on those machines, it will be lost (this is very unlikely to apply to most users, but just in case, you have been warned).
Power3 and Power8 currently have signficant hardware issues - they will "go away" - after the build there will be only Power[1..6]. We will be doubling the memory in these machines to make them somewhat more responsive as interactive servers.
PBS/Centurion Cluster Downtime - 7/1-7/3
The Centurion Cluster is starting to experience a relatively high number of disk failures; approximately 1/3 of the nodes are currently offline because of this. We need to take the cluster down to replace the existing disk drives.
While we are replacing drives, we'll be updating the Linux distro to Ubuntu LTS (8.04 - Hardy) - the new dept. standard distribution. Aside from being headless, this distribution is the same as we are deploying to desktops and on the Power nodes.
While we have the cluster down, the PBS server & scheduler will be unavailable. Although not all the non-Centurion cluster nodes will go down, PBS on those nodes will also be down as we will be upgrading our PBS distribution at the same time.
The full downtime is expected to take two or three days, but the first nodes, including the PBS server & scheduler, will be back on 7/2. We will begin at 9AM on 7/1, so please be sure your jobs are likely to complete before then!
Announcements
None.
Contact Us
Emergency hotline: 982-2271 (What constitutes an emergency?)
Who we are
