Tải bản đầy đủ (.pdf) (105 trang)

The Practice of System and Network Administration Second Edition phần 6 doc

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.1 MB, 105 trang )

486 Chapter 20 Maintenance Windows
20.1.7.2 KVM and Console Service
Two data center elements that make management easier are KVM switches
and serial console servers. Both can be instrumental in making maintenance
windows easier to run by making it possible to remotely access the console
of a machine.
A KVM switch permits multiple computers to all share the same key-
board, video display, and mouse. A KVM switch saves space in a data center—
monitors and keyboards take up a lot of space—and makes access more con-
venient; indeed, more sophisticated console access systems can be accessed
from anywhere in the network.
A serial console server connects devices with serial consoles—systems
without video output, such as network routers, switches, and many U
NIX
servers—to one central device with many serial inputs. By connecting to the
console server, a user can then connect to the serial console of the other
devices. All the computer room equipment that is capable of supporting a
serial console should have its serial console connected to some kind of console
concentrator, such as a networked terminal server.
Much work during a maintenance window requires direct console ac-
cess. Using console access devices permits people to work from their own
desks rather than having to try to coordinate access for many people to the
very limited number of monitors in the computer room or having to waste
computer room space, power, and cooling with more monitors. It is also more
convenient for the individual SAs to work in their own workspace with their
preparatory notes and reference materials around them.
20.1.7.3 Radios
Because the maintenance window is tightly scheduled, the high number of
dependencies, and the occasional unpredictability of system administration
work, all the SAs have to let the flight director know when they are finished
with a task, and before they start a new task, to make sure that the prerequi-


site tasks have all been completed. We recommend using handheld radios to
communicate within the group. Rather than seeking out the flight director,
an SA can simply call over the radio. Likewise, the flight director can contact
the SAs to find out status, and team members and team leaders can find one
another and coordinate over the radio. If SAs need extra help, they can also
ask for it over the radio. There are multiple radio channels, and long con-
versations can move to another channel to keep the primary one free. The
radios are also essential for systemwide testing at the end of the maintenance
window (see Section 20.1.9).
20.1 The Basics 487
It is useful to use radios, cellphones, or some other effective form of two-
way communication for campuswide instant communication between SAs.
We recommend radios because they are not billed by the minute and typically
work better in data center environments than do cellphones. Remember, any-
thing transmitted on the airwaves can be overheard by others, so sensitive
information, such as passwords, should not be communicated over radios,
cellphones, or pagers.
Several options exist for selecting the radios, and what you choose de-
pends on the coverage area that you need, the type of terrain in that area,
availability, and your skill level. It is useful to have multiple channels, or fre-
quencies, available on the handheld radios, so that long conversations can
switch to another channel and leave the primary hailing channel open for
others (see Table 20.3).
Line-of-sight radio communications are the most common and typically
have a range of around 15 miles, depending on the surrounding terrain and
buildings. Your retailer should be able to set you up with one or more fre-
quencies and a set of radios that use those frequencies. Make sure that the
retailer knows that you need the radios to work through buildings and the
coverage that you need.
Table 20.3 Comparison of Radio Technologies

Type Requirements Advantages Disadvantages
Line of sight • Frequency license Simple • Limited range
• Transmits through • Doesn’t transmit
walls through mountains
Repeater • Frequency license • Better range • More complex to run
• Radio operator license • Repeater on mountain • Skill qualifications
enables communication
over mountain
Cellular Service availability • Simple • Higher cost
• Wide range • Available only in
• Unaffected by terrain cellphone providers’
• Less to carry coverage area
• Company contracts
may limit options
• Multiple channels
may not be available
488 Chapter 20 Maintenance Windows
Repeaters can be used to extend the range of a radio signal and are
particularly useful if a mountain between campus buildings would block
line-of-sight communication. It can be useful to have a repeater and an an-
tenna on top of one of the campus buildings in any case, for additional range,
with at least the primary hailing channel using the repeater. This configura-
tion usually requires that someone with a ham radio license set up and operate
the equipment. Check your local laws.
Some cellphone companies offer push-to-talk features on cellphones so
that phones work more like walk-talkies. This option will work wherever
the telephones operate. The provider should be able to provide maps of the
coverage areas. The company should supply all SAs with a cellphone with
this service. This has the advantage that the SAs have to carry only the phone,
not a phone and radio. This can be a quick and convenient way to get a new

group established with radios but may not be feasible if it requires everyone
to change to the same cellphone provider.
If radios won’t work or work badly in your data center because of radio
frequency (RF) shielding, put an internal phone extension with a long cord at
the end of every row, as shown in Figure 6.14. That way, SAs in the data center
can still communicate with other SAs while working in the data center. At
worst, they can go outside the data center, contact someone on the radio, and
arrange to talk to that person on a specific telephone inside the data center.
Setting up a conference call bridge for everyone to dial in to can have
the benefits of radio communication with the benefit that people can dial in
globally to participate. Having a permanent bridge number assigned to the
group makes it easier to memorize and can save critical minutes when needed
for emergencies.
Communication During an Emergency
A major news web site was flooded by users during the attacks of September 11, 2001.
It took a long time to request and receive a conference call bridge and even longer for
all the key players to receive dialing instructions.
20.1.8 Deadlines for Change Completion
A critical role of the flight director is tracking how the various tasks are
progressing and deciding when a particular change should be aborted and
the back-out plan for that change executed. For a general task with no other
dependencies and for which those involved had no other remaining tasks,
that time would be 11
PM on Saturday evening, minus the time required to
20.1 The Basics 489
implement the back-out plan, in the case of a weekend maintenance window.
The flight director should also consider the performance level of the SA team.
If the members are exhausted and frustrated, the flight director may decide
to tell them to take a break or to start the back-out process early if they won’t
be able to implement it as efficiently as they would when they were fresh.

If other tasks depend on that system or service being operational, it
is particularly critical to predefine a cut-off point for task completion. For
example, if a console server upgrade is going badly, it can run into the time
regularly allotted for moving large data files. Once you have overrun one time
boundary, the dependencies can cascade into a full catastrophe, which can
be fixed only at the next scheduled downtime, perhaps another week away.
Make note of what other tasks are regularly scheduled near or during your
maintenance window, so you can plan when to start backing out of a problem.
20.1.9 Comprehensive System Testing
The final stage of a maintenance window is comprehensive system testing. If
the window has been short, you may need to test only the few components
that you worked on. However, if you have spent your weekend-long main-
tenance window taking apart various complicated pieces of machinery and
then putting them back together and all under a time constraint, you should
plan on spending all day Sunday doing system testing.
Sunday system testing begins with shutting down all of the machines in
the data center, so that you can then step through your ordered boot sequence.
Assign an individual to each machine on the reboot list. The flight director
announces the stages of the shutdown sequence over the radio, and each indi-
vidual responds when the machine under their responsibility has completely
shut down. When all the machines at the current stage have shut down, the
flight director announces the next stage. When everything is down, the order
is reversed, and the flight director steps everyone through the boot stages.
If any problems occur with any machine at any stage, the entire sequence is
halted until they are debugged and fixed. Each person assigned to a machine
is responsible for ensuring that it shut down completely before responding
and that all services have started correctly before calling it in as booted and
operational.
Finally, when all the machines in the data center have been successfully
booted in the correct order, the flight director splits the SA team into groups.

Each group has a team leader and is assigned an area in one of the cam-
pus buildings. The teams are given instructions about which machines they
are responsible for and which tests to perform on them. The instructions
490 Chapter 20 Maintenance Windows
always include rebooting every desktop machine to make sure that it comes
up cleanly. The tests could also include logging in, checking for a particular
service, or trying to run a particular application, for example. Each person
in the group has a stack of colored sticky tabs used for marking offices and
cubicles that have been completed and verified as working. The SAs also have
a stack of sticky tabs of a different color to mark cubicles that have a prob-
lem. When SAs run across a problem, they spend a short time trying to fix it
before calling it in to the central core of people assigned to stay in the main
building to help debug problems. As it finishes its area, a team is assigned
to a new area or to help another team to complete an area, until the whole
campus has been covered.
Meanwhile, the flight director and the senior SA troubleshooters keep
track of problems on a whiteboard and decide who should tackle each prob-
lem, based on the likely cause and who is available. By the end of testing,
all offices and cubicles should have tags, preferably all indicating success. If
any offices or cubicles still have tags indicating a problem, a note should be
left for that customer, explaining the problem; someone should be assigned
to meet with that person to try to resolve it first thing in the morning.
This systematic approach helps to find problems before people come in
to work the next day. If there is a bad network segment connection, a failed
software depot push, or problems with a service, you’ll have a good chance
to fix it before anyone else is inconvenienced. Be warned, however, that some
machines may not have been working in the first place. The reboot teams
should always make sure to note when a machine did not look operational
before they rebooted it. They can still take time to try to fix it, but it is
lower on the priority list and does not have to happen before the end of the

maintenance window.
Ideally, the system testing and sitewide rebooting should be completed
sometime on Sunday afternoon. This gives the SA team time to rest after a
stressful weekend before coming into work the next day.
20.1.10 Post-maintenance Communication
Once the maintenance work and system testing have been completed, the
flight director sends out a message to the company, informing everyone that
service should now be fully restored. The message briefly outlines the main
successes of the maintenance window and briefly lists any services that are
known not to be functioning and when they will be fixed.
This message should be in a fixed format and written largely in advance,
because the flight director will be too tired to be very coherent or upbeat to
20.1 The Basics 491
write the message at the end of a long weekend. There is also little chance
that anyone who proofreads the message at that point is going to be able to
help, either.
Hidden Infrastructure
Sometimes, customers depend on a server or a service but neglect to inform us, perhaps
because they implemented it on their own. This is what we call hidden infrastructure.
A site had a planned outage, and all power to the building was shut off. Servers
were taken down in an orderly manner and brought back successfully. The following
morning, the following email exchange took place:
From: IT
To: Everyone in the company
All servers in the Burlington office are up and running. Should you have
any issues accessing servers, please open a helpweb ticket.
From: A Developer
To: IT
Devwin8 is down.
From: IT

To: Everyone in the company
Whoever has devwin8 under their desk, turn it on, please.
20.1.11 Re-enable Remote Access
The final act before leaving the building should be to reenable remote access
and restore the voicemail on the helpdesk phone to normal. Make sure that
this appears on the master plan and the individual plans of those responsible.
It can be very easily forgotten after an exhausting weekend, but it is a very
visible, inconvenient, and embarrassing thing to forget, especially because it
can’t be fixed remotely if all remote access was turned off successfully.
20.1.12 Be Visible the Next Morning
It is very important for the entire SA group to be in early and to be visible
to the company the morning after a maintenance window, no matter how
hard they have worked during the outage. If everyone has company or group
shirts, coordinate in advance of the maintenance window so that all the SAs
wear those shirts on the day after the outage. Have the people who look after
particular departments roam the corridors of those departments, keeping eyes
and ears open for problems.
492 Chapter 20 Maintenance Windows
Have the flight director and some of the senior SAs from the central core-
services group, if there is one, sit in the helpdesk area to monitor incoming
calls and listen for problems that may be related to the maintenance window.
These people should be able to detect and fix them sooner than the regu-
lar helpdesk staff, who won’t have such an extensive overview of what has
happened.
A large visible presence when the company returns to work sends the mes-
sage: “We care, and we are here to make sure that nothing we did disrupts
your working hours.” It also means that any undetected problems can be han-
dled quickly and efficiently, with all the relevant staff on-site and not having
to be paged out of their beds. Both of these factors are important in the overall
satisfaction of the company with the maintenance window. If the company is

not satisfied with how the maintenance windows are handled, the windows
will be discontinued, which will make preventive maintenance more difficult.
20.1.13 Postmortem
By about lunchtime of the day after the maintenance window, most of the
remaining problems should have been found. At that point, if it is sufficiently
quiet, the flight director and some of the senior SAs should sit down and talk
about what went wrong, why, and what can be done differently. That should
all be noted and discussed with the whole group later in the week. Over
time, with the postmortem process, the maintenance windows will become
smoother and easier. Common mistakes early on are taking on too much, not
doing enough work ahead of time, and underestimating how long something
will take.
20.2 The Icing
Although a lot of basics must be implemented for a successful large-scale
maintenance window, a few more things are nice to have. After completion
of some successful maintenance windows, you should start thinking about
the icing that will make your maintenance windows more successful.
20.2.1 Mentoring a New Flight Director
It can be useful to mentor new flight directors for future maintenance win-
dows. Therefore, flight directors must be selected far enough in advance so
that the one for the next maintenance window can work with the current
flight director.
20.2 The Icing 493
The trainee flight director can produce the first draft of the master plan,
using the change requests that were submitted, adding in any dependencies
that are missing, and tagging those additions. The flight director then goes
over the plan with the trainee, adds or subtracts dependencies, and reorga-
nizes the tasks and personnel assignments as appropriate, explaining why. Al-
ternatively, the flight director can create the first draft along with the trainee,
explaining the process while doing so. The trainee flight director can also help

out during the maintenance window, time permitting, by coordinating with
the flight director to track status of certain projects and suggesting realloca-
tion of resources where appropriate. The trainee can also help out before the
downtime by discussing projects with some of the SAs if the flight director
has questions about the project and by ensuring that the prerequisites listed
in the change proposal are met in advance of the maintenance window.
20.2.2 Trending of Historical Data
It is useful to track how long particular tasks take and then analyze the
data later and improve on the estimates in the task submission and planning
process. For example, if you find that moving a certain amount of data be-
tween two machines took 8 hours and you have a large data move between
two similar machines on similar networks another time, you can more accu-
rately predict how long it will take. If a particular software package is always
difficult to upgrade and takes far longer than anticipated, that will be tracked,
anticipated, allowed for in the schedule, and watched closely during the main-
tenance interval.
Trending is particularly useful in passing along historical knowledge.
When someone who used to perform a particular function has left the group,
the person who takes over that function can look back at data from previous
maintenance windows to see what sorts of tasks are typically performed in
this area and how long they take. This data can give people new to the group
and to planning a maintenance window a valuable head start so that they
don’t waste a maintenance opportunity and fall behind.
For each change request, record actual time to completion for use when
calculating time estimates next time around. Also record any other notes that
will help improve the process next time.
20.2.3 Providing Limited Availability
It is highly likely that at some point, you will be asked to keep service available
for a particular group during a maintenance window. It may be something
494 Chapter 20 Maintenance Windows

unforeseen, such as a newly discovered bug that engineering needs to work
on all weekend, or it may be a new mode of operation for a division, such
as customer support switching to 24/7 service and needing continuous access
to its systems to meet its contracts. Internet services, remote access, global
networks, and new-business pressure reduce the likelihood that a full and
complete outage will be permitted.
Planning for this requirement could involve rearchitecting some services
or introducing added layers of redundancy to the system. It may involve
making groups more autonomous or distinct from one another. Making
these changes to your network can be significant tasks by themselves,
likely requiring their own maintenance window; it is best to be prepared
for these requests before they arrive, or you may be left without time to
prepare.
To approach this task, find out what the customers will need to be
able to do during the maintenance window. Ask a lot of questions, and
use your knowledge of the systems to translate these needs into a set of
service-availability requirements. For example, customers will almost cer-
tainly need name service and authentication service. They may need to be
able to print to specific printers and to exchange email within the com-
pany or with customers. They may require access to services across wide-
area connections or across the Internet. They may need to use particular
databases; find out what those machines depend on. Look at ways to make
the database machines redundant so that they can also be properly main-
tained without loss of service. Make sure that the services they depend on
are redundant. Identify what pieces of the network must be available for
the services to work. Look at ways to reduce the number of networks that
must be available by reducing the number of networks that the group uses
and locating redundant name servers, authentication servers, and print
servers on the group’s networks. Find out whether small outages are ac-
ceptable, such as a couple of 10-minute outages for reloading network

equipment. If not, the company needs to invest in redundant network
equipment.
Devise a detailed availability plan that describes exactly what services
and components must be available to that group. Try to simplify it by consol-
idating the network topology and introducing redundant systems for those
networks. Incorporate availability planning into the master plan by ensuring
that redundant servers are not down simultaneously.
20.2 The Icing 495
20.2.4 High-Availability Sites
By the very nature of their business, high-availability sites cannot afford to
have large planned outages.
2
This also means that they cannot afford not to
make the large investment necessary to provide high availability. Sites that
have high-availability requirements need to have lots of hot redundant sys-
tems that continue providing service when any one component fails. The
higher the availability requirement, the more redundant systems that are re-
quired to achieve it.
3
These sites still need to perform maintenance on the systems in service.
Although the availability guarantees that these sites make to their customers
typically exclude maintenance windows, they will lose customers if they have
large planned outages.
20.2.4.1 The Similarities
Most of the principles described here for maintenance windows at a corporate
site apply at high-availability sites.

They need to schedule the maintenance window so that it has the least
impact on their customers. For example, ISPs often choose 2
AM (local

time) midweek; e-commerce sites need to choose a time when they do
the least business. These windows will typically be quite frequent, such
as once a week, and shorter, perhaps 4 to 6 hours in duration.

They need to let their customers know when maintenance windows
are scheduled. For ISPs, this means sending an email to the customers.
For an e-commerce site, this means having a banner on the site. In
both cases, it should be sent only to those customers who may be af-
fected and should contain a warning that small outages or degraded
service may occur during the maintenance window and give the times
of that window. There should be only a single message about the
window.
2. High availability is anything above 99.9 percent. Typically, sites will be aiming for three nines
(99.9 percent) (9 hours downtime per year), four nines (99.99 percent) (1 hour per year), or five nines
(99.999 percent) (5 minutes per year). Six nines (99.9999 percent) (less than 1 minute a year) is more
expensive than most sites can afford.
3. Recall that n + 1 redundancy is used for services such that any one component can fail without
bringing the service down, n + 2 means any two components can fail, and so on.
496 Chapter 20 Maintenance Windows

Planning and doing as much as possible beforehand is critical because
the maintenance windows should be as short as possible.

There must be a flight director who coordinates the scheduling and
tracks the progress of the tasks. If the windows are weekly, this may be
a quarter-time or half-time job.

Each item should have a change proposal. The change proposal should
list the redundant systems and include a test to verify that the redundant
systems have kicked in and that service is still available.


They need to tightly plan the maintenance window. Maintenance win-
dows are typically smaller in scope and shorter in time. Items scheduled
by different people for a given window should not have dependencies
on each other. There must be a small master plan that shows who has
what tasks and their completion times.

The flight director must be very strict about the deadlines for change
completion.

Everything must be fully tested before it is declared complete.

Remote KVM and console access benefit all sites.

The SAs need to have a strong presence when the site approaches and
enters its busy time. They need to be prepared to deal quickly with any
problems that may arise as a result of the maintenance.

A brief postmortem the next day to discuss any remaining problems or
issues that arose is useful.
20.2.4.2 The Differences
There also are several differences in maintenance windows for high-
availability sites.

A prerequisite is that the site must have redundancy if it is to have high
availability.

It is not necessary to disable access. Services should remain available.

It is not necessary to have a full shutdown/boot list, because a full

shutdown/reboot does not happen. However, there should be a depen-
dency list if there are any dependencies between machines.
4
4. Usually, high-availability sites avoid dependencies between machines as much as possible.
20.3 Conclusion 497

Because ISPs and e-commerce sites do not have on-site customers, being
physically visible the morning after is irrelevant. However, being avail-
able and responsive is still important. Find ways to increase your visibil-
ity and ensure excellent responsiveness. Advertise what the change was,
how to report problems, and so on. Maintain a blog, or put banner ad-
vertisements on your internal web sites advertising the newest features.

A post-maintenance communication is usually not required, unless cus-
tomers must be informed about remaining problems. Customers don’t
want to be bombarded with email from their service providers.

The most important difference is that the redundant architecture of the
site must be taken into account during the maintenance window plan-
ning. The flight director needs to make sure that none of the scheduled
work can take the service down. The SAs need to make sure that they
know how long failover takes to happen. For example, how long does
the routing system take to reach convergence when one of the routers
goes down or comes back up? If redundancy is implemented within a
single machine, the SA needs to know how to work on one part of the
machine while keeping the system operating normally.

Availability of the service as a whole must be closely monitored dur-
ing the maintenance window. There should be a plan for how to deal
with any failure that causes an outage as a result of temporary lack of

redundancy.
20.3 Conclusion
The basics for successfully executing a planned maintenance window fall
into three categories: preparation, execution, and post-maintenance customer
care. The advance preparation for a maintenance window has the most effect
on whether it will run smoothly. Planning and doing as much as possible in
advance are key. The group needs to appoint an appropriate flight director
for each maintenance window. Change proposals should be submitted to the
flight director, who uses them to build a master plan and set completion
deadlines for each task.
During the maintenance window, remote access should be disabled, and
infrastructure, such as console servers and radios, should be in place. The
plan needs to be executed with as few hiccups as possible. The timetable
must be adhered to rigidly; it must finish with complete system testing.
498 Chapter 20 Maintenance Windows
Good customer care after the maintenance window is important to its
success. Communication about the window and a visible presence the morn-
ing after are key.
Integrating a mentoring process, saving historical data and doing trend
analysis for better estimates, providing continuity, and providing limited
availability to groups that request it can be incorporated at a later date. Proper
planning, good back-out plans, strict adherence to deadlines for change com-
pletion, and comprehensive testing should avert all but some minor disas-
ters. Some tasks may not be completed, and those changes will need to be
backed out. In our experience, a well-planned, properly executed maintenance
window never leads to a complete disaster. A badly planned or poorly exe-
cuted one could, however. These kinds of massive outages are difficult and
risky. We hope that you will find the planning techniques in this chapter
useful.
Exercises

1. Read the paper on how the AT&T/Lucent network was split (Limoncelli
et al. 1997), and consider how having a weekend maintenance window
would have changed the process. What parts of that project would have
been performed in advance as preparatory work, what parts would have
been easier, and what parts would have been more difficult? Evaluate the
risks in your approach.
2. A case study in Section 20.1.1 describes the scheduling process for a
particular software company. What are the dates and events that you
would need to avoid for a maintenance window in your company? Try
to derive a list of dates, approximately 3 months apart, that would work
for your company.
3. Consider the SAs in your company. Who do you think would make good
flight directors and why?
4. What tasks or projects can you think of at your site that would be ap-
propriate for a maintenance window? Create and fill in a change-request
form. What preparation could you do for this change in advance of the
maintenance window?
5. Section 20.1.6 discusses disabling access to the site. What specific tasks
would need to be performed at your site, and how would you re-enable
that access?
Exercises 499
6. Section 20.1.7.1 discusses the shutdown and reboot sequence. Build an
appropriate list for your site. If you have permission, test it.
7. Section 20.2.3 discusses providing limited availability for some people
to be able to continue working. What groups are likely to require 24/7
availability? What changes would you need to make to your network and
services infrastructure to keep services available to each of those groups?
8. Research the flight operations methodologies used at NASA. Relate what
you learned to the practice of system administration.
This page intentionally left blank

Chapter 21
Centralization and
Decentralization
This chapter seeks to help an SA decide how much centralization is
appropriate, for a particular site or service, and how to transition between
more and less centralization.
Centralization means having one focus of control. One might have two
DNS servers in every department of a company, but they all might be con-
trolled by a single entity. Alternatively, decentralized systems distribute con-
trol to many parts. In our DNS example, each of those departments might
maintain and control its own DNS server, being responsible for maintain-
ing the skill set to stay on top of the technology as it changes, to architect
the systems as it sees fit, and to monitor the service. Centralization refers
to nontechnical control also. Companies can structure IT in a centralized or
decentralized manner.
Centralization is an attempt to improve efficiency by taking advantage
of potential economies of scale: improving the average; it may also improve
reliability by minimizing opportunities for error. Decentralization is an at-
tempt to improve speed and flexibility by reorganizing to increase local
control and execution of a service: improving the best case. Neither is al-
ways better, and neither is always possible in the purest sense. When each
is done well, it can also realize the benefits of the other: odd paradox,
isn’t it?
Decentralization means breaking away from the prevailing hegemony,
revolting against the frustrating bureaucratic ways of old. Traditionally, it
means someone has become so frustrated with a centralized service that “do
it yourself” has the potential of being better. In the modern environment
decentralization is often a deliberate response to the faster pace of business
and to customer expectations of increased autonomy.
501

502 Chapter 21 Centralization and Decentralization
Centralization means pulling groups together to create order and enforce
process. It is cooperation for the greater good. It is a leveling process. It seeks
to remove the frustrating waste of money on duplicate systems, extra work,
and manual processes. New technology paradigms often bring opportunities
for centralization. For example, although it may make sense for each depart-
ment to have slightly different processes for handling paper forms, no one de-
partment could fund building a pervasive web-based forms system. Therefore,
a disruptive technology, such as the web, creates an opportunity to replace
many old systems with a single, more efficient, centralized system. Conversely,
standards-based web technology can enable a high degree of local autonomy
under the aegis of a centralized system, such as delegated administration.
21.1 The Basics
At large companies in particular, it seems as if every couple of years, man-
agement decides to centralize everything that is decentralized and vice versa.
Smaller organizations encounter similar changes driven by mergers or opening
of new campuses or field offices. In this section, we discuss guiding principles
you should consider before making such broad changes. We then discuss some
services that are good candidates for centralization and decentralization.
21.1.1 Guiding Principles
There are several guiding principles related to centralization and decentral-
ization. They are similar to what anyone making large, structural changes
should consider.

Problem-Solving: Know what specific problem you are trying to solve.
Clearly define what problem you are trying to fix. “Reliability is in-
consistent because each division has different brands of hardware.”
“Services break when network connections to sales offices are down.”
Again, write down the specific problem or problems and communicate
these to your team. Use this list as a reality check later in the project to

make sure that you haven’t lost sight of the goal. If you are not solving
a specific problem, or responding to a direct management request, stop
right here. Why are you about to make these changes? Are you sure this
is a real priority?

Motivation: Understand your motivation for making the change. Maybe
you are seeking to save money, increase speed or become more flexible.
21.1 The Basics 503
Maybe your reasons are political: You are protecting your empire or
your boss, making your group look good, or putting someone’s personal
business philosophy into action. Maybe you are doing it simply to make
your own life easier; that’s valid too. Write down your motivation and
remind yourself of it from time to time to verify that you haven’t strayed.

Experience Counts: Use your best judgment. Sometimes, you must use
experience and a hunch rather than specific scientific measurements.
For example, we’ve found that when centralizing email servers, our ex-
perience has developed these rules of thumb: Small companies—five
departments with 100 people—tend to need one email server. Larger
companies can survive with an email server per thousands of people,
especially if there is one large headquarters and many smaller sales
offices. When the company grows to the point of having more than
one site, each site tends to require its own email server but is unlikely
to require its own Internet gateway. Extremely large or geographically
diverse companies start to require multiple Internet gateways at different
locations.

Involvement: Listen to the customers’ concerns. Consult with customers
to understand their expectations: Retain the good aspects and fix the
bad ones. Focus on the qualities that they mention, not the implemen-

tation. People might say that they like the fact that “Karen was always
right there when we needed new desktop PCs installed.” That is an
implementation. The new system might not include on-site personnel.
What should be retained is that the new service has to be responsive—as
responsive as having Karen standing right there. That may mean the
use of overnight delivery services or preconfigured and “ready to eat”
systems
1
stashed in the building, or whatever it takes to meet that expec-
tation. Alternatively, you must do expectation setting if the new system
is not going to deliver on old expectations. Maybe people will have to
plan ahead and ask for workstations a day in advance.

Be Realistic: Be circumspect about unrealistic promises. You should
thoroughly investigate any claims that you will save money by decentral-
izing, add flexibility by centralization, or have an entirely new system
without pain: The opposite is usually the case. If a vendor promises
that a new product will perform miracles but requires you to centralize
1. Do not attempt to eat a computer. “Ready to eat” systems are hot spares that will be fully functional
when powered up: absolutely no configuration files to modify and so on.
504 Chapter 21 Centralization and Decentralization
(or decentralize) how something is currently organized, maybe the ben-
efits come from the organizational change, not the product!

Balance: Centralize as much as makes sense for today, with an eye
toward the future. You must find the balance between centralization
and decentralization. There are time considerations: Building the perfect
system will take forever. You must set realistic goals yet keep an eye to
future needs. For example, in 6 months, the new system will be complete
and then will be expected to process a million widgets per day. However,

a different architecture will be required to process 2 million widgets per
day, the rate that will be needed a year later, and will require consider-
ably more development time. You must balance the advantage of having
a new system in 6 months—with the problem of needing to start build-
ing the next-generation system immediately—versus the advantage of
waiting longer for a system that will not need to be replaced so soon.

Access: The more centralized something is, the more likely it is that some
customers will need a special feature or some kind of customization.
An old business proverb is: “All of our customers are the same: They
each have unique requirements.” One size never fits all. You can’t do
a reasonable job of centralizing without being flexible; you’ll doom the
project if you try. Instead, look for a small number of models. Some
customers require autonomy. Some may require performing their own
updates, which means creating a system of access control so that cus-
tomers can modify their own segments without affecting others.

No Pressure: It’s like rolling out any new service. Although more emo-
tional impact may be involved than with other changes, both centraliza-
tion and decentralization projects have issues similar to building a new
service. That said, new services require careful coordination, planning,
and understanding of customer needs to succeed.

110 Percent: You have only one chance to make a good first impression.
A new system is never trusted until proven a success, and the first ex-
perience with the new system will set the mood for all future customer
interactions. Get it right the first time, even if it means spending more
money up front or taking extra time for testing. Choose test customers
carefully, making sure they trust you to fix any bugs found while testing,
and won’t gossip about it at the coffee machine. Provide superior service

the first month, and people will forgive later mistakes. Mess up right at
the start, and rebuilding a reputation is nearly impossible.
21.1 The Basics 505

Veto Power: Listen to the customers, but remember that management
has the control. The organizational structure can influence the level of
centralization that is appropriate or possible. The largest impediment to
centralization often is management decisions or politics. Lack of trust
makes it difficult to centralize. If the SA team has not proved itself, man-
agement may be unwilling to support the large change. Management
may not be willing to fund the changes, which usually indicates that the
change is not important to them. For example, if the company hasn’t
funded a central infrastructure group, SAs will end up decentralized. It
may be better to have a central infrastructure group; lacking manage-
ment support, however, the fallback is to have each group make the best
subinfrastructure it can—ideally, coordinating formally or informally to
set standards, purchase in bulk, and so on. Either way, the end goal is
to provide excellent service to your customers.
21.1.2 Candidates for Centralization
SAs continually find new opportunities to centralize processes and services.
Centralization does not innately improve efficiency. It brings about the op-
portunity to introduce new economies of scale to a process. What improves
efficiency is standardization, which is usually a by-product of centralization.
The two go hand in hand.
The cost savings of centralization come from the presumption that there
will be less overhead than the sum of the individual overheads of each decen-
tralized item. Centralization can create a simpler, easier-to-manage architec-
ture. One SA can manage a lot more machines if the processes for each are
the same.
To the previous owners of the service being centralized, centralization is

about giving up control. Divisions that previously provided their own service
now have to rely on a centralized group for service. SAs who previously did
tasks themselves, their own way, now have to make requests of someone else
who has his or her own way to do things. The SAs will want to know whether
the new service provider can do things better.
Before taking control away from a previous SA or customer, ask yourself
what the customer’s psychological response will be. Will there be attempts to
sabotage the effort? How can you convince people that the new system will
be better than the old system? How will damage control and rumor control
be accomplished? What’s the best way to make a good first impression?
506 Chapter 21 Centralization and Decentralization
The best way to succeed in a centralization program is to pick the right
services for centralization. Here are some good candidates.

Distributed Systems: Management of distributed systems. Historically,
each department of an organization configured and ran its own web
servers. As the technology got more sophisticated, less customization of
each web server was required. Eventually, there was no reason not to
have each web server configured exactly the same way, and the need
for rapid updates of new binaries was becoming a security issue. The
motivation was to save money by not requiring each department to have
a high level of web server expertise. The problem being fixed was the
lack of similar configurations on each server. A system was designed to
maintain a central configuration repository that would update each of
the servers in a controlled and secure manner. The customers affected
were the departmental SAs, who were eager to give up a task that they
didn’t always understand. By centralizing web services, the organization
could also afford to have one or more SAs become better-trained in that
particular service, to provide better in-house customer support.


Consolidation: Consolidate services onto fewer hosts. In the past, for
reliability’s sake, one service was put on each physical host. However,
as technology progresses, it can be beneficial to have many services on
one machine. The motivation is to decrease cost. The problem being
fixed is that every host has overhead costs, such as power, cooling, ad-
ministration, machine room space, and maintenance contracts. Usually,
a single, a more powerful machine costs less to operate than several
smaller hosts. As services are consolidated, care must be taken to group
customers with similar needs.
Since the late 1990s, storage consolidation has been a big buzzword.
By building one large storage-area network that each server accesses,
there is less “stranded storage”—partially-full disks—on each server.
Often, storage consolidation involves decommissioning older, slower, or
soon-to-fail disks and moving the data onto the SAN, providing better
performance and reliability.
Server virtualization, a more recent trend, involves using virtual
hosts to save hardware and license costs. For example, financial insti-
tutions used to have expensive servers and multiple backup machines
to run a calculation at a particular time of the day, such as making
end-of-day transactions after the stock market closes. Instead, a virtual
21.1 The Basics 507
machine can be spun up shortly before the market closes; the machine
runs its tasks, then spins down. Once it is done, the server is free to run
other virtual machines that do other periodic tasks.
By using a global file system, such as a SAN, a virtualization cluster
can be built. Since the virtual machine images—the data stored on disk
that defines the state of a virtual machine—can be accessed from many
hardware servers, advanced virtualization management software can
migrate virtual machines between physical machines with almost unnot-
icable switch-over time. Many times, sites realize that they need many

machines, each performing a particular function, none of which requires
enough CPU horsepower to justify the cost of dedicated hardware.
Instead, the virtual machines can share a farm, or cluster, of physical
machines, as needed. Since virtual machines can migrate between dif-
ferent hardware nodes, workload can be rebalanced. Virtual machines
can be moved off an overloaded physical machine. Maintenance be-
comes easier too. If one physical machine is showing signs of hardware
problems, virtual machines can be migrated off it onto a spare machine
with no loss of service; the physical machine can then be repaired or
upgraded.

Administration: System administration. When redesigning your organi-
zation (see Chapter 30), your motivation may be to reduce cost, improve
speed, or provide services consistently throughout the enterprise. The
problem may be the extra cost of having technical management for each
team or that the distributed model resulted in some divisions’ having
poorer service than others. Centralizing the SA team can fix these
problems.
To provide customization and the “warm fuzzies” of personal at-
tention, subteams might focus on particular customer segments. An ex-
cellent example of this is a large hardware company’s team of “CAD
ambassadors,” an SA group that specializes in cross-departmental sup-
port of CAD/CAM tools throughout the company. However, a common
mistake is to take this to an extreme. We’ve seen at least one amazingly
huge company that centralized to the point that “customer liaisons”
were hired to maintain a relationship with the customer groups, and the
customers hired liaisons to the centralized SA staff. Soon, these liaisons
numbered more than 100. At that point, the savings in reduced over-
head were surely diminished. A regular reminder and dedication to the
original motivation may have prevented that problem.

508 Chapter 21 Centralization and Decentralization

Specialization: Expertise. In decentralized organizations, a few of the
groups are likely to have more expertise in particular areas than other
groups do. This is fine if they maintain casual relationships and help
one another. However, certain expertise can become critical to busi-
ness, and therefore an informal arrangement becomes an unacceptable
business risk. In that case, it may make sense to consolidate that exper-
tise into one group. The motivation is to ensure that all divisions have
access to a minimum level of expertise in one specific area or areas.
The problem is that the lack of this expertise causes uneven service lev-
els, for example, if one division had unreliable DNS but others didn’t
or if one division had superior Internet email service, whereas others
were still using UUCP-style addresses. (If you are too young to remem-
ber UUCP-style addresses, just count your blessings.) That would be
intolerable!
Establishing a centralized group for one particular service can bring
uniformity and improve the average across the entire company. Some
examples of this include such highly specialized skills as maintaining
an Internet gateway, a software depot, various security issues—VPN
service, intrusion detection, security-hole scanning, and so on—DNS,
and email service. A common pattern at larger firms is to create a “Care
Services” or “Infrastructure” team to consolidate expertise in these areas
and provide infrastructure across the organization.

Left Hand, Right Hand: Infrastructure decisions. The creation of infras-
tructure and platform standards can be done centrally. This is a subcase
of centralizing expertise. The motivation at one company was that that
infrastructure costs were high and interoperability between divisions
was low. There were many specific problems to be solved. Every divi-

sion had a team of people researching new technologies and making
decisions independently. Each team’s research duplicated the effort of
the others. Volume-purchasing contracts could not be signed, because
each individual division was too small to qualify. Repair costs were high
because so many different spare parts had to be purchased. When di-
visions did make compatible purchasing decisions, multiple spare parts
were still being purchased because there was no coordination or coop-
eration. The solution was to reduce the duplication in effort by hav-
ing one standards committee for infrastructure and platform standards.
Previously, new technology was often adopted in pockets around the
company because some divisions were less averse to risk; these became
21.1 The Basics 509
the divisions that performed product trials or became early adopters of
new technology.
This last example brings up another benefit of centralization. The
increased purchasing power should mean that better equipment can be
purchased for the same cost. Vendors may provide better service, as well
as preferred pricing, when they deal with a centralized purchasing group
that reflects the true volume of orders from that one source. Sometimes,
money can be saved through centralization. Other times, it is better to
use the savings to invest in better equipment.

Commodity: If it has become a commodity, consider centralization. A
good time to consider centralizing something is when the technology
path it has taken has made it a commodity. Network printing, file service,
email servers, and even workstation maintenance used to be unique, rare
technologies. However, now these things are commodities and excellent
candidates for centralization.
Case Study: Big, Honkin’ File Servers
Tom’s customers and even fellow SAs fought long and hard against the concept of

large, centralized file servers. The customers complained about the loss of control and
produced, in Tom’s opinion, ill-conceived pricing models that demonstrated that the
old U
NIX-based file servers were the better way to go. What they were really fighting
was the notion that network file service was no longer very special; it had become
a commodity and therefore an excellent candidate for centralization. Eventually, an
apples-to-apples comparison was done. This included a total cost-of-ownership model
that included the SA time and energy to maintain the old-style systems. The value
of some unique features of the dedicated file servers, such as file system snapshot,
was difficult to quantify. However, even when the cost model showed the systems to
cost about the same per gigabyte of usable storage, the dedicated file servers had
an advantage over the old systems: consistency and support. The old systems were
a mishmash of various manufacturers for the host; for the RAID controllers; and for
the disk drives, cables, network interfaces, and, in some cases, even the racks they sat
in! Each of these items usually required a level of expertise and training to maintain
efficiently, and no single vendor would support these Frankenstein monsters. Usually,
when the SA who purchased a particular RAID device left the group, the expertise left
with the person. Standardizing on a particular product resulted in a higher level of
service because the savings were used to purchase top-of-the line systems that had
fewer problems than inexpensive competitors. Also, having a single phone number
to call for support was a blessing.
510 Chapter 21 Centralization and Decentralization
Printing is another commodity service that has many opportunities for
centralization, both in the design of the service itself and when purchasing
supplies. Section 24.1.1 provides more examples.
21.1.3 Candidates for Decentralization
Decentralization does not automatically improve response times. When done
correctly, it creates an opportunity to do so. Even when the new process is
less efficient or is inefficient in different ways, people may be satisfied simply
to be in control. We’ve found that people are more tolerant of a mediocre

process if they feel they control it.
Decentralization often trades cost efficiency for something even more
valuable. In these examples, we decentralize to democratize control, gain
fault tolerance, acquire the ability to have a customized solution, or remove
ourselves from clue-lacking central authorities. (“They’re idiots, but they’re
our division’s idiots.”) One must seek to retain what was good about the old
system while fixing what was bad.
Decentralization democratizes control. The new people gaining control
may require training; this includes both the customers and the SAs. The goal
may be autonomy, the ability to control one’s own destiny, or the ability to
be functional when disconnected from the network. This latter feature is also
referred to as compartmentalization, the ability to achieve different reliability
levels for different segments of the community. Here are some good candidates
for decentralization.

Fault tolerance. The duplication of effort that happens with decentral-
ization can remove single points of failure. A company with growing
field offices required all employees to read email off servers located
in the headquarters. There were numerous complaints that during net-
work outages, people couldn’t read or even compose email, because
composition required access to directory servers that were also at the
headquarters. Divisions in other time zones were particularly upset that
maintenance times at the headquarters were their prime working hours.
The motivation was to increase reliability, in particular access during
outages. The problem was that people couldn’t use email when WAN
links were down. The solution was to install local LDAP caches and
email servers in each of the major locations. (It was convenient and
effective to also use this host for DNS, authentication, and other ser-
vices.) Although mail would not be transmitted site to site during an

×