Chapter 9: Testing
Embedded systems software testing shares much in common with application
software testing. Thus, much of this chapter is a summary of basic testing
concepts and terminology. However, some important differences exist between
application testing and embedded systems testing. Embedded developers often
have access to hardware-based test tools that are generally not used in application
development. Also, embedded systems often have unique characteristics that
should be reflected in the test plan. These differences tend to give embedded
systems testing its own distinctive flavor. This chapter covers the basics of testing
and test case development and points out details unique to embedded systems
work along the way.
Why Test?
Before you begin designing tests, it’s important to have a clear understanding of
why you are testing. This understanding influences which tests you stress and
(more importantly) how early you begin testing. In general, you test for four
reasons:
To find bugs in software (testing is the only way to do this)
To reduce risk to both users and the company
To reduce development and maintenance costs
To improve performance
To Find the Bugs
One of the earliest important results from theoretical computer science is a proof
(known as the Halting Theorem) that it’s impossible to prove that an arbitrary
program is correct. Given the right test, however, you can prove that a program is
incorrect (that is, it has a bug). It’s important to remember that testing isn’t about
proving the “correctness” of a program but about finding bugs. Experienced
programmers understand that every program has bugs. The only way to know how
many bugs are left in a program is to test it with a carefully designed and
measured test plan.
To Reduce Risk
Testing minimizes risk to yourself, your company, and your customers. The
objectives in testing are to demonstrate to yourself (and regulatory agencies, if
appropriate) that the system and software works correctly and as designed. You
want to be assured that the product is as safe as it can be. In short, you want to
discover every conceivable fault or weakness in the system and software before
it’s deployed in the field.
Developing Mission-Critical Software Systems
Incidents such as the Therac-25 radiation machine malfunction — in which several
patients died due to a failure in the software monitoring the patients — should
serve as a sobering reminder that the lives of real people might depend on the
quality of the code that you write. I’m not an expert on writing safety-critical code,
but I’ve identified some interesting articles on mission-critical software
development:
Brown, Doug. “Solving the Software Safety Paradox.” Embedded
Systems Programming, December 1998, 44.
Cole, Bernard. “Reliability Becomes an All-Consuming Goal.” Electronic
Engineering Times, 13 December 1999, 90.
Douglass, Bruce Powel. “Safety-Critical Embedded Systems.”
Embedded Systems Programming, October 1999, 76.
Knutson, Charles and Sam Carmichael. “Safety First: Avoiding Software
Mishaps.” Embedded Systems Programming, November 2000, 28.
Murphy, Niall. “Safe Systems Through Better User Interfaces.”
Embedded Systems Programming, August 1998, 32.
Tindell, Ken. “Real-Time Systems Raise Reliability Issues.” Electronic
Engineering Times, 17 April 2000, 86.
To Reduce Costs
The classic argument for testing comes from Quality Wars by Jeremy Main.
In 1990, HP sampled the cost of errors in software development during the year.
The answer, $400 million, shocked HP into a completely new effort to eliminate
mistakes in writing software. The $400M waste, half of it spent in the labs on
rework and half in the field to fix the mistakes that escaped from the labs,
amounted to one-third of the company’s total R&D budget…and could have
increased earnings by almost 67%.[5]
The earlier a bug is found, the less expensive it is to fix. The cost of finding errors
and bugs in a released product is significantly higher than during unit testing, for
example (see Figure 9.1
).
Figure 9.1: The cost to fix a problem.
Simplified graph showing the cost to fix a problem as a function of the
time in the product life cycle when the defect is found. The costs
associated with finding and fixing the Y2K problem in embedded systems
is a close approximation to an infinite cost model.
To Improve Performance
Testing maximizes the performance of the system. Finding and eliminating dead
code and inefficient code can help ensure that the software uses the full potential
of the hardware and thus avoids the dreaded “hardware re-spin.”
When to Test?
It should be clear from Figure 9.1 that testing should begin as soon as feasible.
Usually, the earliest tests are module or unit tests conducted by the original
developer. Unfortunately, few developers know enough about testing to build a
thorough set of test cases. Because carefully developed test cases are usually not
employed until integration testing, many bugs that could be found during unit
testing are not discovered until integration testing. For example, a major network
equipment manufacturer in Silicon Valley did a study to figure out the key sources
of its software integration problems. The manufacturer discovered that 70 percent
of the bugs found during the integration phase of the project were generated by
code that had never been exercised before that phase of the project.
Unit Testing
Individual developers test at the module level by writing stub code to substitute for
the rest of the system hardware and software. At this point in the development
cycle, the tests focus on the logical performance of the code. Typically, developers
test with some average values, some high or low values, and some out-of-range
values (to exercise the code’s exception processing functionality). Unfortunately,
these “black-box” derived test cases are seldom adequate to exercise more than a
fraction of the total code in the module.
Regression Testing
It isn’t enough to pass a test once. Every time the program is modified, it should
be retested to assure that the changes didn’t unintentionally “break” some
unrelated behavior. Called regression testing, these tests are usually automated
through a test script. For example, if you design a set of 100 input/output (I/O)
tests, the regression test script would automatically execute the 100 tests and
compare the output against a “gold standard” output suite. Every time a change is
made to any part of the code, the full regression suite runs on the modified code
base to insure that something else wasn’t broken in the process.
From the Trenches
I try to convince my students to apply regression testing to their course projects;
however, because they are students, they never listen to me. I’ve had more than a
few projects turned in that didn’t work because the student made a minor change
at 4:00AM on the day it was due, and the project suddenly unraveled. But, hey,
what do I know?
When to Test?
It should be clear from Figure 9.1 that testing should begin as soon as feasible.
Usually, the earliest tests are module or unit tests conducted by the original
developer. Unfortunately, few developers know enough about testing to build a
thorough set of test cases. Because carefully developed test cases are usually not
employed until integration testing, many bugs that could be found during unit
testing are not discovered until integration testing. For example, a major network
equipment manufacturer in Silicon Valley did a study to figure out the key sources
of its software integration problems. The manufacturer discovered that 70 percent
of the bugs found during the integration phase of the project were generated by
code that had never been exercised before that phase of the project.
Unit Testing
Individual developers test at the module level by writing stub code to substitute for
the rest of the system hardware and software. At this point in the development
cycle, the tests focus on the logical performance of the code. Typically, developers
test with some average values, some high or low values, and some out-of-range
values (to exercise the code’s exception processing functionality). Unfortunately,
these “black-box” derived test cases are seldom adequate to exercise more than a
fraction of the total code in the module.
Regression Testing
It isn’t enough to pass a test once. Every time the program is modified, it should
be retested to assure that the changes didn’t unintentionally “break” some
unrelated behavior. Called regression testing, these tests are usually automated
through a test script. For example, if you design a set of 100 input/output (I/O)
tests, the regression test script would automatically execute the 100 tests and
compare the output against a “gold standard” output suite. Every time a change is
made to any part of the code, the full regression suite runs on the modified code
base to insure that something else wasn’t broken in the process.
From the Trenches
I try to convince my students to apply regression testing to their course projects;
however, because they are students, they never listen to me. I’ve had more than a
few projects turned in that didn’t work because the student made a minor change
at 4:00AM on the day it was due, and the project suddenly unraveled. But, hey,
what do I know?
Which Tests?
Because no practical set of tests can prove a program correct, the key issue
becomes what subset of tests has the highest probability of detecting the most
errors, as noted in The Art of Software Testing by Glen Ford Myers[6]. The
problem of selecting appropriate test cases is known as test case design.
Although dozens of strategies exist for generating test cases, they tend to fall into
two fundamentally different approaches: functional testing and coverage testing.
Functional testing (also known as black-box testing) selects tests that assess how
well the implementation meets the requirements specification. Coverage testing
(also known as white-box testing) selects cases that cause certain portions of the
code to be executed. (These two strategies are discussed in more detail later.)
Both kinds of testing are necessary to test rigorously your embedded design. Of
the two, coverage testing implies that your code is stable, so it is reserved for
testing a completed or nearly completed product. Functional tests, on the other
hand, can be written in parallel with the requirements documents. In fact, by
starting with the functional tests, you can minimize any duplication of efforts and
rewriting of tests. Thus, in my opinion, functional tests come first. Everyone agrees
that functional tests can be written first, but Ross[7], for example, clearly believes
they are most useful during system integration … not unit testing.
The following is a simple process algorithm for integrating your functional and
coverage testing strategies:
1. Identify which of the functions have NOT been fully covered by the
functional tests.
2. Identify which sections of each function have not been executed.
3. Identify which additional coverage tests are required.
4. Run new additional tests.
5. Repeat.
Infamous Software Bugs
The first known computer bug came about in 1946 when a primitive computer
used by the Navy to calculate the trajectories of artillery shells shut down when a
moth got stuck in one of its computing elements, a mechanical relay. Hence, the
name bug for a computer error.[1]
In 1962, the Mariner 1 mission to Venus failed because the rocket went off course
after launch and had to be destroyed at a project cost of $80 million.[2] The
problem was traced to a typographical error in the FORTRAN guidance code. The
FORTRAN statement written by the programmer was
DO 10 I=1.5
This was interpreted as an assignment statement, DO10I = 1.5.
The statement should have been
DO 10 I=1,5.
This statement is a DO LOOP. Do line number 10 for the values of I from one to
five.
Perhaps the most sobering embedded systems software defect was the deadly
Therac-25 disaster in 1987. Four cancer patients receiving radiation therapy died
from radiation overdoses. The problem was traced to a failure in the software
responsible for monitoring the patients’ safety.[4]
When to Stop?
The algorithm from the previous section has a lot in common with the instructions
on the back of every shampoo bottle. Taken literally, you would be testing (and
shampooing) forever. Obviously, you’ll need to have some predetermined criteria
for when to stop testing and to release the product.
If you are designing your system for mission-critical applications, such as the
navigational software in a commercial jetliner, the degree to which you must test
your code is painstakingly spelled out in documents, such as the FAA’s DO-178B
specification. Unless you can certify and demonstrate that your code has met the
requirements set forth in this document, you cannot deploy your product. For most
others, the criteria are less fixed.
The most commonly used stop criteria (in order of reliability) are:
When the boss says
When a new iteration of the test cycle finds fewer than X new bugs
When a certain coverage threshold has been met without uncovering
any new bugs
Regardless of how thoroughly you test your program, you can never be certain you
have found all the bugs. This brings up another interesting question: How many
bugs can you tolerate? Suppose that during extreme software stress testing you
find that the system locks up about every 20 hours of testing. You examine the
TEAMFLY
Team-Fly
®
code but are unable to find the root cause of the error. Should you ship the
product?
How much testing is “good enough”? I can’t tell you. It would be nice to have
some time-tested rule: “if method Z estimates there are fewer than X bugs in Y
lines of code, then your program is safe to release.” Perhaps some day such
standards will exist. The programming industry is still relatively young and hasn’t
yet reached the level of sophistication, for example, of the building industry. Many
thick volumes of building handbooks and codes have evolved over the years that
provide the architect, civil engineer, and structural engineer with all the
information they need to build a safe building on schedule and within budget.
Occasionally, buildings still collapse, but that’s pretty rare. Until programming
produces a comparable set of standards, it’s a judgment call.
Choosing Test Cases
In the ideal case, you want to test every possible behavior in your program. This
implies testing every possible combination of inputs or every possible decision path
at least once. This is a noble, but utterly impractical, goal. For example, in The Art
of Software Testing, Glen Ford Myers[6] describes a small program with only five
decisions that has 10
14
unique execution paths. He points out that if you could
write, execute, and verify one test case every five minutes, it would take one
billion years to test exhaustively this program. Obviously, the ideal situation is
beyond reach, so you must use approximations to this ideal. As you’ll see, a
combination of functional testing and coverage testing provides a reasonable
second-best alternative. The basic approach is to select the tests (some functional,
some coverage) that have the highest probability of exposing an error.
Functional Tests
Functional testing is often called black-box testing because the test cases for
functional tests are devised without reference to the actual code — that is, without
looking “inside the box.” An embedded system has inputs and outputs and
implements some algorithm between them. Black-box tests are based on what is
known about which inputs should be acceptable and how they should relate to the
outputs. Black-box tests know nothing about how the algorithm in between is
implemented. Example black-box tests include:
Stress tests: Tests that intentionally overload input channels, memory
buffers, disk controllers, memory management systems, and so on.
Boundary value tests: Inputs that represent “boundaries” within a
particular range (for example, largest and smallest integers together with –1,
0, +1, for an integer input) and input values that should cause the output to
transition across a similar boundary in the output range.
Exception tests: Tests that should trigger a failure mode or exception
mode.
Error guessing: Tests based on prior experience with testing software
or from testing similar programs.
Random tests: Generally, the least productive form of testing but still
widely used to evaluate the robustness of user-interface code.
Performance tests: Because performance expectations are part of the
product requirement, performance analysis falls within the sphere of functional
testing.
Because black-box tests depend only on the program requirements and its I/O
behavior, they can be developed as soon as the requirements are complete. This
allows black-box test cases to be developed in parallel with the rest of the system
design.
Like all testing, functional tests should be designed to be destructive, that is, to
prove the program doesn’t work. This means overloading input channels, beating
on the keyboard in random ways, purposely doing all the things that you, as a
programmer, know will hurt your baby. As an R&D product manager, this was one
of my primary test methodologies. If 40 hours of abuse testing could be logged
with no serious or critical defects logged against the product, the product could be
released. If a significant defect was found, the clock started over again after the
defect was fixed.
Coverage Tests
The weakness of functional testing is that it rarely exercises all the code. Coverage
tests attempt to avoid this weakness by (ideally) ensuring that each code
statement, decision point, or decision path is exercised at least once. (Coverage
testing also can show how much of your data space has been accessed.) Also
known as white-box tests or glass-box tests, coverage tests are devised with full
knowledge of how the software is implemented, that is, with permission to “look
inside the box.” White-box tests are designed with the source code handy. They
exploit the programmer’s knowledge of the program’s APIs, internal control
structures, and exception handling capabilities. Because white-box tests depend on
specific implementation decisions, they can’t be designed until after the code is
written.
From an embedded systems point of view, coverage testing is the most important
type of testing because the degree to which you can show how much of your code
has been exercised is an excellent predictor of the risk of undetected bugs you’ll be
facing later.
Example white-box tests include:
Statement coverage: Test cases selected because they execute every
statement in the program at least once.
Decision or branch coverage: Test cases chosen because they cause
every branch (both the true and false path) to be executed at least once.
Condition coverage: Test cases chosen to force each condition (term)
in a decision to take on all possible logic values.
Theoretically, a white-box test can exploit or manipulate whatever it needs to
conduct its test. Thus, a white-box test might use the JTAG interface to force a
particular memory value as part of a test. More practically, white-box testing might
analyze the execution path reported by a logic analyzer.
Gray-Box Testing
Because white-box tests can be intimately connected to the internals of the code,
they can be more expensive to maintain than black-box tests. Whereas black-box
tests remain valid as long as the requirements and the I/O relationships remain
stable, white-box tests might need to be re-engineered every time the code is
changed. Thus, the most cost-effective white-box tests generally are those that
exploit knowledge of the implementation without being intimately tied to the
coding details.
Tests that only know a little about the internals are sometimes called gray-box
tests. Gray-box tests can be very effective when coupled with “error guessing.” If
you know, or at least suspect, where the weak points are in the code, you can
design tests that stress those weak points. These tests are gray box because they
cover specific portions of the code; they are error guessing because they are