Tải bản đầy đủ (.pdf) (105 trang)

The Practice of System and Network Administration Second Edition phần 5 pps

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.1 MB, 105 trang )

14.2 The Icing 381
information in step 3, which makes further steps difficult and may require
contacting the customer unnecessarily. A written chart of who is responsible
for what, as well as a list of standard information to be collected for each
classification, will reduce these problems.
14.2.2 Holistic Improvement
In addition to focusing on improving each step, you may also focus on im-
proving the entire process. Transitioning to each new step should be fluid. If
the customer sees an abrupt, staccato handoff between steps, the process can
appear amateurish or disjointed.
Every handoff is an opportunity for mistakes and miscommunication.
The fewer handoffs, the fewer opportunities there are for mistakes.
A site small enough to have a single SA has zero opportunities for this
class of error. However, as systems and networks grow and become more
complicated, it becomes impossible for a single person to understand, main-
tain, and run the entire network. As a system grows, handoffs become a
necessary evil. This explains a common perception that larger SA groups are
not as effective as smaller ones. Therefore, when growing an SA group, you
should focus on maintaining high-quality handoffs. Or, you might choose
to develop a single point of contact or customer advocate for an issue.
That results in the customers’ seeing a single face for the duration of a
problem.
14.2.3 Increased Customer Familiarity
If a customer talks to the same person whenever calling for support, the
SA will likely become familiar with the customer’s particular needs and be
able to provide better service. There are ways to improve the chance of this
happening. For example, SA staff subteams may be assigned to particular
groups of customers rather than to the technology they support. Or, if the
answering phone-staff is extremely large, the group may be using a telephone
call center system, whereby customers call a single number and the call center
routes the call to an available operator. Modern call center systems can route


calls based on caller ID, using this functionality, for example, to route the call
to the same operator the caller spoke to last time, if that person is available.
This means there will be a tendency for customers to be speaking to the
same person each time. It can be very comforting to speak to someone who
recognizes your voice.
382 Chapter 14 Customer Care
14.2.4 Special Announcements for Major Outages
During a major network outage, many customers may be trying to report
problems. If customers report problems through an automatic phone response
system (“Press 1 for , press 2 for ”), such a system can usually be pro-
grammed to announce the network outage before listing the options. “Please
note the network connection to Denver is currently experiencing trouble. Our
service provider expects it to be fixed by 3
PM. Press 1 for press 2 for ”
14.2.5 Trend Analysis
Spend some time each month looking for trends, and take action based on
them. This does not have to be a complicated analysis, as this case study
describes.
Case Study: Who Generates the Most Tickets?
At one site, we simply looked at which customers opened the most tickets in the last
year. We found that 3 of the 600 people opened 10 percent of all tickets. That’s a
lot! It was easy to visit each person’s manager to discuss how we could provide better
service; if the person was generating so many tickets, we obviously weren’t matching
the person’s needs.
One person opened so many tickets because he was pestering the SAs for
workarounds to the bugs in the old version of the L
A
T
E
X typesetting package that he

was using and refused to upgrade to the latest version, which fixed most of the prob-
lems he was reporting. This person’s manager agreed that the best solution would be
for him to require his employee to adopt the latest L
A
T
E
X and took responsibility for
seeing to it that the change was made.
The next manager felt that his employee was asking basic questions and decided
to send the customer for training to make him more self-sufficient.
The last manager felt that his employee was justified in making so many requests.
However, the manager did appreciate knowing how much the employee relied on us
to get his job done. The employee did become more self-sufficient in future months.
Here are some other trends to look for:

Does a customer report the same issue over and over? Why is it re-
curring? Does the customer need training, or is that system really that
broken?

Are there many questions in a particular category? Is that system difficult
to use? Could it be redesigned or replaced, or could the documentation
be improved?
14.2 The Icing 383

Are many customers reporting the same issue? Can they all be notified
at once? Should such problems receive higher priority?

Can some categories of requests become self-service? Often, a customer
request is that an SA is needed because something requires privileged
access, such as superuser or administrator access. Look for ways to em-

power customers to help themselves. Many of these requests can become
self-service with a little bit of web programming. The U
NIX world has the
concept of set user ID (SUID) programs, which, when properly admin-
istered, permit regular users to run a program that performs privileged
tasks but then lose the privileged access once the program is finished ex-
ecuting. Individual SUID programs can give users the ability to perform
a particular function, and SUID wrapper programs can be constructed
that gain the enhanced privilege level, run a third-party program, and
then reduce the privileges back to normal levels. Writing SUID programs
is very tricky, and mistakes can turn into security holes. Systems such
as
sudo (Snyder, Miller et al. 1986) let you manage SUID privilege on
a per user and per command basis and have been analyzed by enough
security experts to be considered a relatively safe way to provide SUID
access to regular users.

Who are your most frequent customers? Calculate which department
generates the most tickets or who has the highest average tickets per
member. Calculate which customers make up your top 20 percent of
requests. Do these ratios match your funding model, or are certain cus-
tomer groups more “expensive” than others?

Is a particular time-consuming request one of your frequent requests?
If customers often accidentally delete files and you waste a lot of time
each week restoring files from tape, you can invest time in helping the
user learn about
rm -i or use other safe-delete programs. Or, maybe it
would be appropriate to advocate for the purchase of a system that sup-
ports snapshots or lets users do their own restores. If you can generate

a report of the number and frequency of restore requests, management
can make a more informed decision or decide to talk to certain users
about being more careful.
This chapter does not discuss metrics, but a system of metrics grounded
in this model might be the best way to detect areas needing improvement. The
nine-step process can be instrumented easily to collect metrics. Developing
metrics that drive the right behaviors is difficult. For example, if SAs are rated
384 Chapter 14 Customer Care
by how quickly they close tickets, one might accidentally encourage the closer
behavior described earlier. As SAs proactively prevent problems, reported
problems will become more serious and time consuming. If average time to
completion grows, does that mean that minor problems were eliminated or
that SAs are slower at fixing all problems?
3
14.2.6 Customers Who Know the Process
A better-educated customer is a better customer. Customers who understand
the nine steps that will be followed can be better prepared when reporting the
problem. These customers can provide more complete information when they
call, because they understand the importance of complete information in solv-
ing the problem. In gathering this information, they will have narrowed the
focus of the problem report. They might have specific suggestions on how to
reproduce the problem. They may have narrowed the problem down to a spe-
cific machine or situation. Their additional preparation may even lead them
to solve the problem on their own! Training for customers should include
explaining the nine-step process to facilitate interaction between customers
and SAs.
Preparing Customers at the Department of Motor Vehicles
Tom noticed that the New Jersey Department of Motor Vehicles had recently changed its
“on hold” message to include what four documents should be on hand if the person was
calling to renew a vehicle registration. Now, rather than waiting to speak to a person

only to find out that you didn’t have, say, your insurance ID number, there was a better
chance that once connected, you had everything needed to complete the transaction.
14.2.7 Architectural Decisions That Match the Process
Architectural decisions may impede or aid the classification process. The more
complicated a system is, the more difficult it can be to identify and duplicate
the problem. Sadly, some well-accepted software design concepts, such as
delineating a system into layers, are at odds with the nine-step process. For
example, a printing problem in a large U
NIX network could be a problem
3. Strata advocates SA-generated tickets for proactive fixes and planned projects, to make the SA
contributions clearer.
14.3 Conclusion 385
with DNS, the server software, the client software, misconfigured user envi-
ronment, the network, DHCP, the printer’s configuration, or even the printing
hardware itself. Typically, many of those layers are maintained by separate
groups of people. To diagnose the problem accurately requires the SAs to be
experts in all those technologies or that the layers do cross-checking.
You should keep in mind how a product will be supported when you are
designing a system. The electronics industry has the concept of “design for
manufacture”; we should think in terms of “design for support.”
14.3 Conclusion
This chapter is about communication. The process helps us think about how
we communicate with customers, and it gives us a base of terminology to use
when discussing our work. All professionals have a base of terminology to
use to effectively communicate with one an another.
This chapter presents a formal, structured model for handling requests
from customers. The process has four phases: greeting, problem identifica-
tion, planning and execution, and fix and verify. Each phase has distinct steps,
summarized in Table 14.1.
Following this model makes the process more structured and formalized.

Once it is in place, it exposes areas for improvement within your organization.
Table 14.1 Overview of Phases for Problem Solution
Phase Step Role
Phase A: “Hello!” 1. The greeting Greeter
Phase B: “What’s wrong?” 2. Problem classification Classifier
3. Problem statement Recorder
4. Problem verification Reproducer
Phase C: “Fix it” 5. Solution proposals
Subject matter expert
6. Solution selection
7. Execution
Craft worker
Phase D: “Verify it” 8. Craft verification
9. Customer verification Customer
386 Chapter 14 Customer Care
You can integrate the model into training plans for SAs, as well as educate
customers about the model so they can be better advocates for themselves. The
model can be applied for the gathering of metrics. It enables trend analysis,
even if only in simple, ad hoc ways, which is better than nothing.
We cannot stress enough the importance of using helpdesk issue-tracking
software rather than trying to remember the requests in your head, using
scraps of paper, or relying on email boxes. Automation reduces the tedium
of managing incoming requests and collecting statistics. Software that tracks
tickets for you saves time in real ways. Tom once measured that a group of
three SAs was spending an hour a day per person to track issues. That is a
loss of 2 staff days per week!
The process described in this chapter brings clarity to the issue of cus-
tomer support by defining what steps must be followed for a single successful
call for help. We show why these steps are to be followed and how each step
prepares you for future steps.

Although knowledge of the model can improve an SA’s effectiveness by
leveling the playing field, it is not a panacea; nor is it a replacement for cre-
ativity, experience, or having the right resources. The model does not replace
the right training, the right tools, and the right support from management,
but it must be part of a well-constructed helpdesk.
Many SAs are naturally good at customer care and react negatively to
structured techniques like this one. We’re happy for those who have found
their own structure and use it with consistently great results. We’re sure it
has many of the rudiments discussed here. Do what works for you. To grow
the number of SAs in the field, more direct instruction will be required. For
the millions of SAs who have not found the perfect structure for themselves,
consider this structure a good starting point.
Exercises
1. Are there times when you should not use the nine-step model?
2. What are the tools used in your environment for processing customer
requests, and how do they fit into the nine-step model? Are there ways
they could fit better?
3. What are all the ways to greet customers in your environment? What
ways could you use but don’t? Why?
4. In your environment, you greet customers by various methods. How do
the methods compare by cost, speed (faster completion), and customers’
Exercises 387
preference? Is the most expensive method the one that customers prefer
the most?
5. Some problem statements can be stated concisely, such as the routing
problem example in step 3. Dig into your trouble tracking system to find
five typically reported problems. What is the shortest problem statement
that completely describes the issue?
6. Query your ticket-tracking software and determine who were your top
ten ticket creators overall in the last 12 months; then sort by customer

group or department. Then determine what customer groups have the
highest per customer ticket count. Which customers make up your top
20 percent? Now that you have this knowledge, what will you do?
Examine other queries from Section 14.2.5.
7. Which is the most important of the nine steps? Justify your answer.
This page intentionally left blank
Part III
Change Processes
This page intentionally left blank
Chapter 15
Debugging
In this chapter, we dig deeply into what is involved in debugging problems.
In Chapter 14, we put debugging into the larger context of customer care.
This chapter, on the other hand, is about you and what you do when faced
with a single technical problem.
Debugging is not simply making a change that fixes a problem. That’s
the easy part. Debugging begins by understanding the problem, finding its
cause, and then making the change that makes the problem disappear for
good. Temporary or superficial fixes, such as rebooting, that do not fix the
cause of the problem only guarantee more work for you in the future. We
continue that theme in Chapter 16.
Since anyone reading this book has a certain amount of smarts and a
level of experience,
1
we do not need to be pedantic about this topic. You’ve
debugged problems; you know what it’s like. We’re going to make you con-
scious of the finer points of the process and then discuss some ways of making
it even smoother. We encourage you to be systematic. It’s better than randomly
poking about.
15.1 The Basics

This section offers advice for correctly defining the problem, introduces two
models for finding the problem, and ends with some philosophy about the
qualities of the best tools.
1. And while we’re at it, you’re good looking and a sharp dresser.
391
392 Chapter 15 Debugging
15.1.1 Learn the Customer’s Problem
The first step in fixing a problem is to understand, at a high level, what the
customer is trying to do and what part of it is failing. In other words, the
customer is doing something and expecting a particular result, but something
else is happening instead.
For example, customers may be trying to read their email and aren’t able
to. They may report this in many ways: “My mail program is broken” or
“I can’t reach the mail server” or “My mailbox disappeared!” Any of those
statements may be true, but the problem also could be a network problem,
a power failure in the server room, or a DNS problem. These issues may be
beyond the scope of what the customer understands or should need to under-
stand. Therefore, it is important for you to gain a high-level understanding
of what the customer is trying to do.
Sometimes, customers aren’t good at expressing themselves, so care and
understanding must prevail. “Can you help me understand what the docu-
ment should look like?”
It’s common for customers to use jargon but in an incorrect way. They
believe that’s what the SA wants to hear. They’re trying to be helpful. It’s very
valid for the SA to respond along the lines of, “Let’s back up a little; what
exactly is it you’re trying to do? Just describe it without being technical.”
The complaint is usually not the problem. A customer might complain
that the printer is broken. This sounds like a request to have the printer
repaired. However, an SA who takes time to understand the entire situation
might learn that the customer needs to have a document printed before a

shipping deadline. In that case, it becomes clear that the customer’s complaint
isn’t about the hardware but about needing to print a document. Printing to
a different printer becomes a better solution.
Some customers provide a valuable service by digging into the problem
before reporting it. A senior SA partners with these customers but understands
that there are limits. It can be nice to get a report such as, “I can’t print; I
think there is a DNS problem.” However, we do not believe in taking such
reports at face value. You understand the system architecture better than the
customers do, so you still need to verify the DNS portion of the report, as
well as the printing problem. Maybe it is best to interpret the report as two
possibly related reports: Printing isn’t working for a particular customer, and
a certain DNS test failed. For example, a printer’s name may not be in DNS,
depending on the print system architecture. Often, customers will ping a host’s
15.1 The Basics 393
name to demonstrate a routing problem but overlook the error message and
the fact that the DNS lookup of the host’s name failed.
Find the Real Problem
One of Tom’s customers was reporting that he couldn’t ping a particular server
located about 1,000 miles away in a different division. He provided traceroutes, ping
information, and a lot of detailed evidence. Rather than investigating potential DNS,
routing, and networking issues, Tom stopped to ask, “Why do you require the abil-
ity to ping that host?” It turned out that pinging that host wasn’t part of the person’s
job; instead, the customer was trying to use that host as an authentication server. The
problem to be debugged should have been, “Why can’t I authenticate off this host?”
or even better, “Why can’t I use service A, which relies on the authentication server on
host B?”
By contacting the owner of the server, Tom found the very simple answer: Host B
had been decommissioned, and properly configured clients should have automatically
started authenticating off a new server. This wasn’t a networking issue at all but a
matter of client configuration. The customer had hard-coded the IP address into his

configuration but shouldn’t have. A lot of time would have been wasted if the problem
had been pursued as originally reported.
In short, a reported problem is about an expected outcome not happen-
ing. Now let’s look at the cause.
15.1.2 Fix the Cause, Not the Symptom
To build sustainable reliability, you must find and fix the cause of the problem,
not simply work around the problem or find a way to recover from it quickly.
Although workarounds and quick recovery times are good things, fixing the
root cause of a problem is better.
Often, we find ourselves in a situation like this: A coworker reports that
there was a problem and that he fixed it. “What was the problem?” we
inquire.
“The host needed to be rebooted.”
“What was the problem?”
“I told you! The host needed to be rebooted.”
A day later, the host needs to be rebooted again.
A host needing to be rebooted isn’t a problem but rather a solution. The
problem might have been that the system froze, buggy device drivers were
394 Chapter 15 Debugging
malfunctioning, a kernel process wasn’t freeing memory and the only choice
was to reboot, and so on. If the SA had determined what the true problem
was, it could have been fixed for good and would not have returned.
The same goes for “I had to exit and restart an application or service” and
other mysteries. Many times, we’ve seen someone fix a “full-disk” situation
by deleting old log files. However, the problem returns as the log files grow
again. Deleting the log files fixes the symptoms, but activating a script that
would rotate and automatically delete the logs would fix the problem.
Even large companies get pulled into fixing symptoms instead of root
causes. For example, Microsoft got a lot of bad press when it reported that
a major feature of Windows 2000 would be to make it reboot faster. In

fact, users would prefer that it didn’t need to be rebooted so often in the
first place.
15.1.3 Be Systematic
It is important to be methodical, or systematic, about finding the cause and
fixing it. To be systematic, you must form hypotheses, test them, note the
results, and make changes based on those results. Anything else is simply
making random changes until the problem goes away.
The process of elimination and successive refinement are commonly used
in debugging. The process of elimination entails removing different parts of
the system until the problem disappears. The problem must have existed in
the last portion removed. Successive refinement involves adding components
to the system and verifying at each step that the desired change happens.
The process of elimination is often used when debugging a hardware
problem, such as replacing memory chips until a memory error is elimi-
nated or pulling out cards until a machine is able to boot. Elimination is
used with software applications, such as eliminating potentially conflict-
ing drivers or applications until a failure disappears. Some OSs have tools
that search for possible conflicts and provide test modes to help narrow
the search.
Successive refinement is an additive process. To diagnose an IP routing
problem,
traceroute reports connectivity one network hop away and then
reports connectivity two hops away, then three, four, and so on. When the
probes no longer return a result, we know that the router at that hop wasn’t
able to return packets. The problem is at that last router. When connectivity
exists but there is packet loss, a similar methodology can be used. You can
15.1 The Basics 395
send many packets to the next router and verify that there is no packet loss.
You can successively refine the test by including more distant routers until
the packet loss is detected. You can then assert that the loss is on the most

recently added segment.
Sometimes, successive refinement can be thought of as follow-the-path
debugging. To do this, you must follow the path of the data or the problem,
reviewing the output of each process to make sure that it is the proper input
to the next stage. It is common on U
NIX systems to have an assembly-line
approach to processing. One task generates data, the others modify or process
the data in sequence, and the final task stores it. Some of these processes may
happen on different machines, but the data can be checked at each step. For
example, if each step generates a log file, you can monitor the logs at each
step. When debugging an email problem that involves a message going from
one server to a gateway and then to the destination server, you can watch
the logs on all three machines to see that the proper processing has happened
at each place. When tracing a network problem, you can use tools that let
you snoop packets as they pass through a link to monitor each step. Cisco
routers have the ability to collect packets on a link that match a particular
set of qualifications, U
NIX systems have tcpdump, and Windows systems have
Ethereal. When dealing with U
NIX software that uses the shell — (pipe) facility
to send data through a series of programs, the
tee command can save a copy
at each step.
Shortcuts and optimizations can be used with these techniques. Based
on past experience, you might skip a step or two. However, this is often a
mistake, because you may be jumping to a conclusion.
We’d like to point out that if you are going to jump to a conclusion, the
problem is often related to the most recent change made to the host, network,
or whatever is having a problem. This usually indicates a lack of testing.
Therefore, before you begin debugging a problem, ponder for a moment what

changes were made recently: Was a new device plugged into the network?
What was the last configuration change to a host? Was anything changed on
a router or a firewall? Often, the answers direct your search for the cause.
15.1.4 Have the Right Tools
Debugging requires the right diagnostic tools. Some tools are physical devices;
others, software tools that are either purchased or downloaded; and still
others, homegrown. However, knowledge is the most important tool.
396 Chapter 15 Debugging
Diagnostic tools let you see into a device or system to see its inner work-
ings. However, if you don’t know how to interpret what you see, all the data
in the world won’t help you solve the problem.
Training usually involves learning how the system works, how to view
its inner workings, and how to interpret what you see. For example, when
training someone how to use an Ethernet monitor (sniffer), teaching the
person how to capture packets is easy. Most of the training time is spent
explaining how various protocols work so that you understand what you
see. Learning the tool is easy. Getting a deeper understanding of what the
tool lets you see takes considerably longer.
U
NIX systems have a reputation for being very easy to debug, most likely
because so many experienced U
NIX SAs have in-depth knowledge of the in-
ner workings of the system. Such knowledge is easily gained. U
NIX systems
come with documentation about their internals; early users had access to
the source code itself. Many books dissect the source code of U
NIX kernels
(Lions 1996; McKusick, Bostic, and Karels 1996; Mauro and McDougall
2000) for educational purposes. Much of the system is driven by scripts that
can be read easily to gain an understanding of what is going on behind the

scenes.
Microsoft Windows developed an early reputation for being a difficult
system to debug when problems arose. Rhetoric from the U
NIX community
claimed that it was a black box with no way to get the information you needed
to debug problems on Windows systems. In reality, there were mechanisms,
but the only way to learn about them was through vendor-supplied training.
From the perspective of a culture that is very open about information, this
was difficult to adjust to. It took many years to disseminate the information
about how to access Windows’ internals and how to interpret what was
found.
Know Why a Tool Draws a Conclusion
It is important to understand not only the system being debugged but also the tools being
used to debug it. Once, Tom was helping a network technician with a problem: A PC
couldn’t talk to any servers, even though the link light was illuminated. The technician
disconnected the PC from its network jack and plugged in a new handheld device that
could test and diagnose a large list of LAN problems. However, the output of the device
was a list of conclusions without information about how it was arriving at them. The
15.1 The Basics 397
technician was basing his decisions on the output of this device without question. Tom
kept asking, “The device claims it is on network B, but how did it determine that?” The
technician didn’t know or care. Tom stated, “I don’t think it is really on network B!
Network B and C are bridged right now, so if the network jack were working, it should
claim to be on network B and C at the same time.” The technician disagreed, because
the very expensive tool couldn’t possibly be wrong, and the problem must be with
the PC.
It turned out that the tool was guessing the IP network after finding a single host on
the LAN segment. This jack was connected to a hub, which had another workstation
connected to it in a different office. The uplink from the hub had become disconnected
from the rest of the network. Without knowing how the tool performed its tests, there

was no way to determine why a tool would report such a claim, and further debugging
would have been a wild goose chase. Luckily, there was a hint that something was
suspicious—it didn’t mention network C. The process of questioning the conclusion
drew them to the problem’s real cause.
What makes a good tool? We prefer minimal tools over large, complicated
tools. The best tool is one that provides the simplest solution to the problem
at hand. The more sophisticated a tool is, the more likely it will get in its own
way or simply be too big to carry to the problem.
NFS mounting problems can be debugged with three simple tools:
ping,
traceroute, and rpcinfo. Each does one thing and does that one thing well.
If the client can’t mount from a particular server, make sure that they can
ping each other. If they can’t, it’s a network problem, and
traceroute can
isolate the problem. If
ping succeeded, connectivity is good, and there must
be a protocol problem. From the client, the elements of the NFS protocol
can be tested with
rpcinfo.
2
You can test the portmap traceroute function,
then
mountd, nfs, nlockmgr, and status. If any of them fail, you can deduce
that the appropriate service isn’t working. If all of them succeed, you can
deduce that it is an export permission problem, which usually means that
the name of the host listed in the export list is not exactly what the server
sees when it performs a reverse DNS lookup. These are extremely powerful
diagnostics that are done with extremely simple tools. You can use
rpcinfo
for all Sun RPC-based protocolss (Stern 1991).

Protocols based on TCP often can be debugged with a different triad of
tools:
ping, traceroute/tracert, and telnet. These tools are available on
2. For example, rpcinfo -T udp servername portmap in Solaris or rpcinfo -u servername
portmap
in Linux.
398 Chapter 15 Debugging
every platform that supports TCP/IP (Windows, UNIX, and others). Again,
ping and traceroute can diagnose connectivity problems. Then telnet can
be used to manually simulate many TCP-based protocols. For example, email
administrators know enough of SMTP (Crocker 1982) to TELNET to port
25 of a host and type the SMTP commands as if they were the client; you can
diagnose many problems by watching the results. Similar techniques work
for NNTP (Kantor and Lapsley 1986), FTP (Postel and Reynolds 1985), and
other TCP-based protocols. TCP/IP Illustrated, Volume 1, by W. Richard
Stevens (Stevens 1994) provides an excellent view into how the protocols
work.
Sometimes, the best tools are simple homegrown tools or the combination
of other small tools and applications, as the following anecdote shows.
Find the Latency Problem
Once, Tom was tracking reports of high latency on a network link. The problem hap-
pened only occasionally. He set up a continuous (once per second) ping between two ma-
chines that should demonstrate the problem and recorded this output for several hours.
He observed consistently good (low) latency, except that occasionally, there seemed to
be trouble. A small
perl program was written to analyze the logs and extract pings
with high latency—latency more than three times the average of the first 20 pings—and
highlight missed pings. He noticed that no pings were being missed but that every so
often, a series of pings took much longer to arrive. He used a spreadsheet to graph the
latency over time. Visualizing the results helped him notice that the problem occurred

every 5 minutes, within a second or two. It also happened at other times, but every
5 minutes, he was assured of seeing the problem. He realized that some protocols do
certain operations every 5 minutes. Could a route table refresh be overloading the CPU
of a router? Was a protocol overloading a link?
By process of elimination, he isolated the problem to a particular router. Its CPU was
being overloaded by routing table calculations, which happened every time there was
a real change to the network plus every 5 minutes during the usual route table refresh.
This agreed with the previously collected data. The fact that it was an overloaded CPU
and not an overloaded network link explained why latency increased but no packets
were lost. The router had enough buffering to ensure that no packets were dropped.
Once he fixed the problem with the router, the ping test and log analysis were used again
to demonstrate that the problem had been fixed.
The customer who had reported the problem was a scientist with a particularly con-
descending attitude toward SAs. After confirming with him that the problem had been
resolved, the scientist was shown the methodology, including the graphs of timing data.
His attitude improved significantly once he found respect for their methods.
15.2 The Icing 399
Regard Tuning as Debugging
Six classes of bugs limit network performance:
1. Packet losses, corruption, congestion, bad hardware
2. IP routing, long round-trip times
3. Packet reordering
4. Inappropriate buffer space
5. Inappropriate packet sizes
6. Inefficient applications
Any one of these problems can hide all other problems. This is why solving perfor-
mance problems requires a high level of expertise. Because debugging tools are rarely
very good, it is “akin to finding the weakest link of an invisible chain” (Mathis 2003).
Therefore, if you are debugging any of these problems and are not getting anywhere,
pause a moment and consider that it might be one of the other problems.

15.2 The Icing
The icing really improves those basics: better tools, better knowledge about
how to use the tools, and better understanding about the system being
debugged.
15.2.1 Better Tools
Better tools are, well, better! There is always room for new tools that improve
on the old ones. Keeping up to date on the latest tools can be difficult, pre-
venting you from being an early adopter of new technology. Several forums,
such as USENIX and SAGE conferences, as well as web sites and mailing
lists, can help you learn of these new tools as they are announced.
We are advocates for simple tools. Improved tools need not be more
complex. In fact, a new tool sometimes brings innovation through its
simplicity.
When evaluating new tools, assess them based on what problems they
can solve. Try to ignore the aspects that are flashy, buzzword-compliant,
and full of hype. Buzzword-compliant is a humorous term meaning that
the product applies to all the current industry buzzwords one might see in
the headlines of trade magazines, whether or not such compliance has any
benefit.
400 Chapter 15 Debugging
Ask, “What real-life problem will this solve?” It is easy for salespeople
to focus you on flashy, colorful output, but does the flash add anything to
the utility of the product? Is the color used intelligently to direct the eye at
important details, or does it simply make the product pretty? Are any of the
buzzwords relevant? Sure, it supports SNMP, but will you integrate it into
your SNMP monitoring system? Or is SNMP simply used for configuring the
device?
Ask for an evaluation copy of the tool, and make sure that you have time
to use the tool during the evaluation. Don’t be afraid to send it back if you
didn’t find it useful. Salespeople have thick skins, and the feedback you give

will help them make the product better in future releases.
15.2.2 Formal Training on the Tools
Although manuals are great, formal training can be the icing that sets you
apart from others. Formal training has a number of benefits.

Off-site training is usually provided, which takes you away from the
interruptions of your job and lets you focus on learning new skills.

Formal training usually covers all the features, not only the ones with
which you’ve had time to experiment.

Instructors often will reveal bugs or features that the vendor may not
want revealed in print.

Often, you have access to a lab of machines where you can try things
that, because of production requirements, you couldn’t otherwise try.

You can list the training on your resume; this can be more impressive to
prospective employers than actual experience, especially if you receive
certification.
15.2.3 End-to-End Understanding of the System
Finally, the ultimate debugging icing is to have at least one person who under-
stands, end to end, how the system works. On a small system, that’s easy. As
systems grow larger and more complex, however, people specialize and end
up knowing only their part of the system. Having someone who knows the
entire system is invaluable when there is a major outage. In a big emergency,
it can be best to assemble a team of experts, each representing one layer of
the stack.
15.2 The Icing 401
Case Study: Architects

How do you retain employees who have this kind of end-to-end knowledge? One way
is to promote them.
Synopsys had ‘‘ar chitect’’ positions in each technology area who were this kind
of end-to-end person. The architects knew more than simply their technology area
in depth, and they were good crossover people. Their official role was to track the
industry direction: predict needs and technologies 2 to 5 years out and start preparing
for them (prototyping, getting involved with vendors as alpha/beta customers, and
helping to steer the direction of the vendors’ products); architecting new services;
watching what was happening in the group; steering people toward smarter, more
scalable solutions; and so on. This role ensured that such people were around when
end-to-end knowledge was required for debugging major issues.
Mystery File Deletes
Here’s an example of a situation in which end-to-end knowledge was required to fix
a problem. A customer revealed that some of his files were disappearing. To be more
specific, he had about 100MB of data in his home directory, and all but 2MB had
disappeared. He had restored his files. The environment had a system that let users
restore files from backups without SA intervention. However, a couple of days later,
the same thing happened; this time, a different set of files remained. Again, the files
remaining totaled 2MB. He then sheepishly revealed that this had been going on for a
couple of weeks, but he found it convenient to restore his own files and felt embarrassed
to bother the SAs with such an odd problem.
The SA’s first theory was that there was a virus, but virus scans revealed nothing. The
next theory was that someone was playing pranks on him or that there was a badly
written cron job. He was given pager numbers to call the moment his files disappeared
again. Meanwhile, network sniffers were put in place to monitor who was deleting files
on that server. The next day, the customer alerted the SAs that his files were disappearing.
“What was the last thing you did?” Well, he had simply logged in to a machine in a
lab to surf the web. The SAs were baffled. The network-monitoring tools showed that
the deletions were not coming from the customer’s PC or from a rogue machine or
misprogrammed server. The SAs had done their best to debug the problem using their

knowledge of their part of the system, yet the problem remained unsolved.
Suddenly, one of the senior SAs with end-to-end knowledge of the system, includ-
ing both Windows and U
NIX and all the various protocols involved, realized that web
browsers keep a cache that gets pruned to stay lower than a certain limit, often 2MB.
Could the browser on this machine be deleting the files? Investigation revealed that
the lab machine was running a web browser configured with an odd location for its
402 Chapter 15 Debugging
cache. The location was fine for some users, but when this user logged in, the location
was equivalent to his home directory because of a bug (or feature?) related to how
Windows parsed directory paths that involved nonexistent subdirectories. The browser
was finding a cache with 100MB of data and deleting files until the space used was less
than 2MB. That explained why every time the problem appeared, a different set of files
remained. After the browser’s configuration was fixed, the problem disappeared.
The initial attempts at solving the problem—virus scans, checking for cron jobs,
watching protocols—had proved fruitless because they were testing the parts. The prob-
lem was solved only by someone having end-to-end understanding of the system.
Knowledge of Physics Sometimes Helps
Sometimes, even having end-to-end knowledge of a system is insufficient. In two famous
cases, knowledge of physics was required to track down the root cause of a problem.
The Cuckoo’s Egg (Stoll 1989) documents the true story of how Cliff Stoll tracked
down an intruder who was using his computer system. By monitoring network delay
and applying some physics calculations, Stoll was able to accurately predict where in
the world the intruder was located. The book reads like a spy novel but is all real!
The famous story “The Case of the 500-Mile Email” and associated FAQ (Harris
2002) documents Trey Harris’s effort to debug a problem that began with a call from the
chairman of his university’s statistics department, claiming, “We can’t send mail more
than 500 miles.” After explaining that “email really doesn’t work that way,” Harris
began a journey that discovered that, amazingly enough, this in fact was the problem.
A timeout was set too low, which was causing problems if the system was connecting

to servers that were far away enough that the round-trip delay was more than a very
small number. The distance that light would travel in that time was 3 millilightseconds,
or about 558 miles.
15.3 Conclusion
Every SA debugs problems and typically develops a mental catalog of standard
solutions to common problems. However, debugging should be a systematic,
or methodical, process that involves understanding what the customer is try-
ing to do and fixing the root cause of the problem, rather than smoothing
over the symptoms. Some debugging techniques are subtractive—process of
elimination—and others are additive—successive refinement. Fixing the root
cause is important because if the root problem isn’t repaired, the problem
will recur, thus generating more work for the SA.
Although this chapter strongly stresses fixing the root problem as soon as
possible, you must sometimes provide a workaround quickly and return later
Exercises 403
to fix the root problem (see Chapter 14). For example, you might prefer quick
fixes during production hours and have a maintenance window (Chapter 20)
reserved for more permanent and disruptive fixes.
Better tools let you solve problems more efficiently without adding undue
complexity. Formal training on a tool provides knowledge and experience
that you cannot get from a manual. Finally, in a major outage or when a
problem seems peculiar, nothing beats one or more people who together have
end-to-end knowledge of the system.
Simple tools can solve big problems. Complicated tools sometimes
obscure how they draw conclusions.
Debugging is often a communication process between you and your cus-
tomers. You must gain an understanding of the problem in terms of what the
customer is trying to accomplish, as well as the symptoms discovered so far.
Exercises
1. Pick a technology that you deal with as part of your job function. Name

the debugging tools you use for that technology. For each tool, is it
homegrown, commercial, or free? Is it simple? Can it be combined with
other tools? What formal or informal training have you received on this
tool?
2. Describe a recent technical problem that you debugged and how you
resolved it.
3. In an anecdote in Section 15.1.4, the customer was impressed by the
methodology used to fix his problem. How would the situation be differ-
ent if the customer were a nontechnical manager rather than a scientist?
4. What tools do you not have that you wish you did? Why?
5. Pick a tool you use often. In technical terms, how does it do its job?
This page intentionally left blank
Chapter 16
Fixing Things Once
Fixing something once is better than fixing it over and over again. Although
this sounds obvious, it sometimes isn’t possible, given other constraints;
or, you find yourself fixing something over and over without realizing it,
or the quick fix is simply emotionally easier. By being conscious of these
things, you can achieve several goals. First, you can manage your time bet-
ter. Second, you can become better SAs. Third, if necessary, you can ex-
plain better to the customer why you are taking longer than expected to fix
something.
Chapter 15 described a systematic process for debugging a problem. This
chapter is about a general day-to-day philosophy.
16.1 The Basics
One of our favorite mantras is “fix things once.” If something is broken,
it should be fixed once, such that it won’t return. If a problem is likely to
appear on other machines, it should be tested for and repaired on all other
machines.
16.1.1 Don’t Waste Time

Sometimes, particularly for something that seems trivial or that affects you
only for the time being, it can seem easier to do a quick fix that doesn’t fix the
problem permanently. It may not even cross your mind that you are fixing
multiple times a problem that you could have fixed once with a little more
effort.
405

×