mailing list archives
Computer Room Emergency: Only a Matter of Time
From: "Gideon T. Rasmussen, CISSP, CISA, CISM, CFSO, SCSA" <lists () infostruct net>
Date: Thu, 02 Dec 2004 16:48:37 -0500
Computer Room Emergency – Only a Matter of Time
Gideon T. Rasmussen - CISSP, CISM, CFSO, SCSA
It's an infrastructure manager's worst nightmare: The computer room is
down. There are several events that can make this scenario a reality. A
hurricane knocks out power for several days. Building management
disrupts power for scheduled maintenance. Construction workers sever an
underground power line.
Small and medium sized organizations may not have adequate UPS and
generator systems. In that case, it is only a matter of time before
power is disrupted and the computer room must be shut down. Enterprise
class computer rooms with an absolute requirement for 24/7 uptime may
still be disrupted by an emergency. For example, a liquid spill such as
sprinkler discharge or a glycol spill from a broken AC pipe may
necessitate an emergency shutdown.
Every six months completely power down the computer room to prepare for
the inevitable. Hold a meeting to plan the exercise. The goal is to
efficiently stop and start mission critical systems as quickly as
possible. Assign tasks to each team member. The meeting is an ideal time
Start by documenting the order in which systems will be shut down.
Address all critical systems. In addition to servers, this includes
networking gear, telecommunications equipment, UPS and AC units. Halt
systems in order of criticality. This helps minimize damage in the event
that UPS or generator systems fail. Shut down systems that are prone to
data loss or are painful to restore early on. Also consider dependencies
between systems. For example, the infrastructure supporting a
three-tiered application must be shut down in order. In most
organizations the development systems will be last on the list.
Carefully consider the order for starting systems as well. The start
order will be slightly different, as there is no concern for failing
Create an operations guide for each system. Each OPS guide should be a
single point of reference. Detail stop/start procedures, where the
system is located and how to confirm it is providing services (versus
merely running from an operating system perspective). Keep in mind that
the guide may be used by a technologist who has little or no experience
with the system. Include a revision date at the bottom of each page.
Policies and procedures should ensure current administrative passwords
are available and appropriately safeguarded. Maintain a recall roster so
that the infrastructure team can be contacted in the event of an emergency.
Label systems and racks for easy identification (front and back). If a
keyboard, video, mouse (KVM) device is in use, label it with the systems
it is connected to and the key sequence required to switch between them.
Hardware may fail once powered down. Ensure tech support contracts are
current and support phone numbers are documented. Current backups and
installation media must also be on hand at the time of the exercise.
Consider whether the computer room UPS system can handle the current
load. Have new systems been added in the past six months? It might make
sense to have a UPS technician on-site and test UPS capacity and system
Print the OPS guides and staple them separately. Separate guides enable
personnel to work without sharing documentation. Upon completion of a
task, they can return to the team lead to address any remaining systems.
This also helps track progress and makes efficient use of resources.
Meet again before the exercise and conduct a dry run-through. Take note
of any issues and fine tune the documentation.
A senior team member should direct and monitor the progress of the
exercise. Coordinate and reassign resources as they become available.
Make use of available personnel and system keyboards. Take note of
elapsed time, discrepancies in documentation and issues as they arise.
Document functionality testing to ensure that once systems are powered
up they are providing the services required. Turn off internal
monitoring systems as late as possible. A shutdown exercise is the
perfect opportunity to test monitoring. Document notification from
external monitoring services as well.
At the conclusion of the exercise, the time required to shut down and
restart the enterprise systems will be known. The preparation required
keeps documentation current. The exercise itself provides valuable
on-the-job training. This continuity helps eliminate single points of
Provide senior management with a formal report detailing the results of
the exercise. Powering down the computer room is one of the first steps
of taking ownership of the organization’s infrastructure. In my
experience, many things fall out of this exercise. It is better to learn
about them during a maintenance window rather than complicate an
- Computer Room Emergency: Only a Matter of Time Gideon T. Rasmussen, CISSP, CISA, CISM, CFSO, SCSA (Dec 03)