Yes, You Can Test Every Line of Code!

Bugs, Bugs, Bugs

How many software bugs have you triggered today? Sometimes it seems that they have become part of the fabric of our lives. I'm still fuming about Wells Fargo Bank's bill-pay system, which automatically defaulted to deducting from the account with the least amount of money and didn't bother to tell me. Two bills paid, two overdrafts, no notices from Wells Fargo until it was too late, and I'm out $50.00 in overdraft fees. To add insult to injury, they charged me a monthly fee too.

I can (and did) cancel the buggy bill-pay service, but you can't always cancel the software you use. I can't easily replace Android on my smartphone, so I dutifully reboot it every few days. Otherwise, odd things happen, or it reboots itself at an inopportune moment. I don't run a lot of apps on it (I use it mostly to manage passwords and make phone calls), so what could be scrambling the operating system? A bug, or two, or three. Rebooting a phone - what a concept.

Keeping Your Bugs from Getting Out

All software has bugs, especially when it is first written. Even if there is a formal proof that the implementation meets the specification, the specification can have bugs. The only way to be sure that the software does what you want is to test it with real data.

Testing is awful. Testing is boring. Testing is useless - except when it finds a bug. Testing keeps programmers from doing real work.

Testing improves code. Testing documents the code. Testing increases confidence in the code. Testing early saves time later. Testing keeps customers happy (and programmers paid).

Which of the above two paragraphs do you relate to the most? I'll bet that the first one made you nod your head in agreement, at least until you read the last sentence of the second. Testing is in fact work that we wish we didn't have to do, but software never starts perfect and never stays perfect. The sooner you find problems, the better. As a result, robust testing has significant benefits - see The Cost of Debugging Software.

Full testing has its proponents, of course; Extreme Programming is one example of a methodology that promotes full testing as software is written. High-performance scientific and engineering code does not lend itself well to a philosophy that every feature should be tested and implemented (in that order) in a single day. The methods I use are closer to (and actually pre-date) the methods used to test digital designs prior to logic synthesis: typical examples plus boundary conditions plus directed white-box tests. They've served me well for 20 years.

White Box vs. Black Box Testing

Testing every line of code is "white box" testing by definition - someone has to look at the code to determine which cases need coverage. Although it's possible to have a programmer who did not write the code look at it to select the test cases, most often the original developer writes most if not all of the tests.

"Black box" testing, on the other hand, works only from a specification. The code is supposed to have a certain set of features, and an input range in which it is supposed to deliver those features, and you could argue that nothing else needs testing. Unfortunately, the implementation of the code is not always obvious from the specification, and there may be special case code or optimizations inside which could have bugs that are not obvious corruptions of the specification.

In practice, all software testing is a mix of "white box" and "black box" testing. "Black box" tests exercise basic functionality just to see if the code works. "White box" tests on top of that ensure that all special case code is run at least once - that every line of code is executed.

Standalone Test Programs

The goals of testing are different than the goals of the final program, so naturally testing code will be structured differently than the final program. I use independent standalone test programs so that there are no conflicts with the final program and the test programs stay reasonably small.

You want to test each layer of the target code independently, so usually there will be at least one standalone test program per code layer. At the very least you will want one standalone program for every layer that implements a subsystem (typically, code with a published API, and perhaps with some routines hidden from users to reduce program coupling). I normally have one test program per code module, so that I always know where to find the test code.

Multiple programs in a build sequence imply that you cannot use an IDE to run all of the tests at once; I always have build scripts (usually makefiles) for my code. When developing in an IDE, my "build target" is a single test program, and the build command within the IDE invokes the external build script. As a bonus, this eases platform-independent development; every platform uses the same build strategy.

For full testing, each test program should have a routine which tests one or a small number of routines in the target code. This is of course easier if each routine in the target code has an interface in the header file. Sometimes this is not practical, so you will have to access hidden routines through the code that calls them. Controllability and observability (two concepts from digital design) are greatly restricted in this case, so hide routines with caution. At some point you may have to publish low-level routines and trust users of your code not to abuse them. Of course, you will want to add comments to the header files notifying users that a set of routines is "public for testing only."

Test the code in dependency order - lower layers first, higher layers afterward. If you have a problem in a lower layer, it is likely that results from higher layers will not be valid, so there isn't much point in looking at them.

There is a benefit to testing in dependency order: the possibility of bugs in lower layers can usually be excluded when you analyze failures in upper layer tests - problems are most likely in the module being tested directly. If a test program for an upper layer of code triggers a problem in a lower layer, I will often include a test for the problem in the upper layer test program - there is no reason to remove test code. The "official" test case for the bug is in the test program for the failing module.

The flow of each test routine is the same: set up parameters for the call to the routine under test, call the routine, and check the results (return values, side effects in the parameters). Because the calls will often be very similar, I usually export the setup code to a routine within the test program rather than paste inline code multiple times. Often the test setup code forms the basis for the next layer of code - the same things need to be set up, but in a more general sense.

Sometimes one test will exercise multiple routines at once. For example, there might be unique object setup, followed by a mutator call. The test block would interleave calls to tested routines with checks.

A warning: if you run several tests in a row with similar parameter values, don't reuse the values (typically constants) very many times. If you find that you need to modify a test by changing a parameter value, you could easily invalidate several tests below it. Every so often, reassign all of the parameter values inline. It will slightly increase the number of lines of code in the test program but will make it much easier to understand and maintain.

At the top level, the test program counts the errors detected and returns a failure code to the driver script if any are found. Messages are printed to the standard output as well.

Note that none of the test programs are interactive in any way. Usually they will not even accept parameters; everything necessary to run the tests should be included in (or directly referenced by) the test programs. This allows automated nightly builds and code commit regression checks.

Most of the software I write is for engineering design and analysis, which tends to have deep algorithms with many layers of code. You might argue that these methods are overkill for shallow programs with only a few layers of code and short runtimes. In theory these programs can be tested non-invasively with scripts - run the program with a few parameters, then examine the output. "Black box" testing like this is tempting, but it means you can't test anything until everything is done. You're going to have a lengthy period in which nothing works - one trifling bug after another. And who wants that?

Don't think that you can test lower layers with temporary dummy drivers, either. While you have the dummy drivers, you are in effect doing what I recommend here. Then you are throwing them away, which is a mistake. See Never Throw Test Software Away.

Direct Costs of Full Testing

Software testing is traditionally done after the code is written. Extreme Programming turns this on its head and writes tests before the code is written. The goal is to prevent cuts to testing when (not if) a software schedule starts to slip and hits a hard deadline. If you have already tested all of the features you added, a hard deadline means fewer features, not less testing.

In my R&D work, full testing of a module has typically required about one line of test code for each line of product code. This roughly doubles the "development time" for that module, but I get some of it back when writing and testing the next module - development of the new code can reuse some of the setup code from the previous layer, and testing of the new code avoids the tedium of tracing through multiple layers of older, buggy code.

Your management may complain about development schedules seemingly doubling. If you can't convince them that full testing will pay off quickly, you can still use some of the methods:

Write standalone test programs anyway, but write fewer tests. Use the "black box" approach and write a few typical examples plus a few unusual examples (parameters at one limit or another).
Exercise fewer of the error returns. Unless your program deals with inputs directly from people, many of the error returns will be run rarely if at all.

See When to Stop Adding Tests During Development for more details.

The point is that once you have the test program with the infrastructure to exercise your code, you can easily add new tests to it; you will not be starting from scratch when problems arise. Every time the traditional testing method finds a problem in it, add a directed test case within the test program. You can build code coverage quickly this way, and if your management agrees to full testing, all you need to do is expand on what you have built.

Last but not least, standalone test programs are an instant regression test system. Engineering and Scientific software tends to run a long time (hours to days), and adding real customer test cases to the regression suite makes suite run times explode. With well-architected software, you can manually set up the data configuration that causes an error, fix the problem, and then verify quickly that it does not cause any other problems.

I once worked at a company where the developer's regression suite ran for an hour. Developers had to run this suite before committing changes. If another developer committed code during this interval, the suite had to be run again. Ideally this would occur before the commit, but that was not always practical. If the two changes did not in fact work together, everyone who got both of them would be dead in the water until the problem was fixed or the last commit was reverted.

Even the hour-long suite was far from complete, and the nightly full regression runs (dozens of CPU hours) would sometimes show problems. With several developers committing code each day, it was not always easy to track problems, and commits would sometimes have to be blocked for hours at a time until the bad commit could be isolated and reverted.

Now imagine running the entire regression suite for a computationally intensive software package in a few minutes...

Direct Benefits of Full Testing

There are many direct benefits of my full testing methods:

Immediate testing gets it done while the code is still familiar. There is no need to relearn the code to write the tests. This is cheaper in the long run.
Full testing is much more systematic; you won't have multiple people chasing the same bug if partially tested code is committed and it breaks something.
Tested modules are milestones; there is no indefinite integration period when "nothing works." Like Extreme Programming, this makes testing much more predictable.
Test programs often serve as prototypes for the next layer of code. If you have problems writing tests for a code layer, you will probably have problems using the code in your application. When you start writing the next layer of code, you can copy and adapt the test drivers, saving some typing.
Test programs directed at a specific code layer are much faster than applications level testing. Most of my test programs run in a fraction of a second - less time than it takes to compile them. They're about the same size as the modules they test, so compile time doubles, but this is much better than a one-hour mini-regression suite that doesn't even have full coverage.

Indirect Benefits of Full Testing

Some benefits of full testing are a little harder to quantify, but are still evident:

The usage examples within the test programs help other programmers understand the code and how it is to be used.
Comments within the test programs provide additional documentation of developer intent.
The coding discipline encouraged by per-module testing helps reduce coupling, leading to code that is more reusable (see Testable Is Reusable) and more reliable.
You can really say "it has all been tested" rather than "the bug rate has leveled off, so it must be acceptable now." You have the confidence to say that it works and customers can use it.

Conclusions

Your management may not yet be willing to commit to testing every line of code up front, but there are still things you can do. You can get fairly good coverage by testing just the API routines (see When to Stop Adding Tests During Development); a layered software architecture makes addition of new tests easier (see Writing Code Layer by Layer); and a robust testing methodology (see Software Design for Testability) allows you to test bug fixes quickly. All of the above let you make enhancements easily and know that you are not going to find problems as the software is about to be shipped.

The traditional method of testing software, which accepts uncertainty about its quality, is unacceptable. I really shouldn't have to reboot my phone, or spend an hour of every workday tracking problems reported in the previous night's regression runs. It is long past time to drive towards fully tested software that is fully usable the first day it is delivered.

Chapman Consulting

Software Development Done Right.