There are multiple ways to optimize software that is too slow or too big (uses too much memory): make individual pieces of code more efficient, speed up computation of common cases, or store results for reuse. Here I describe ways to speed up programs by storing previously computed results. See Software Optimization by Code Rewriting for ways to speed up small sections of code and Software Optimization by Pattern Matching for ways to specialize code for common cases.
A first implementation of an algorithm is often simplified in some ways to reduce development time. For most programs, where the performance "hotspots" are not known ahead of time, this is the best strategy. "Make it right before you make it fast." Highly optimized code tends to be more complex, and increased complexity leads to a higher bug rate for two reasons: there are more lines of code, and the number of errors per line of code tends to be higher for code that is not straightforward.
Once you have working software, profiling will tell you where the program spends most of its time (or uses the most memory). This original implementation provides a reference against which you can compare the improved software for correctness and efficiency. If there are differences, you can single-step through the implementations in parallel to identify the point at which the differences begin. If a trial solution is slower or bigger, you can use this information to refine your understanding of the software.
Caching Is All Around You
Simply put, caching is the saving of results that are likely to be needed again. If you had infinite memory, you could save everything your program ever computed and never have to compute anything again. As a rule, however, you never have enough memory for this, and so results must be computed again.
Decades of studies have proven that caching even a subset of expected future results results in significant runtime or delay improvements:
- Every high-performance central processing unit (CPU) has at least one level of memory cache, and sometimes up to three levels of memory cache;
- Most rotating disk drives include a track buffer that stores an entire track from the disk, so that successive reads from a file on that same track do not need to wait for the disk to rotate back under the read head;
- Web servers on the Internet store copies of frequently-requested pages rather than regenerate them; and
- The Internet itself uses caching so that frequently-requested pages are fewer hops away from users (and thus can be delivered faster).
These are just some of the more prominent examples of data caching; there are many more that are not as well known.
Caching works best when past results provide some clue about future results:
- Computers spend most of their time executing loops a few dozen instructions long;
- Operating systems try to store disk data on sequential blocks of a disk drive;
- Today's headlines (or celebrity meltdown) will show up in countless search engine requests;
- Hit songs will be played over and over again; and
- The same movies will be streamed to destinations all over the world.
In all of these examples, each new request has a reasonably high probability of being equivalent to a very recent request. Storing copies of a number of recent requests for a little while will thus save the time and effort required to retrieve or recompute them.
Identifying Opportunities for Caching
These examples are well-known to anyone with a college-level computer science background. Processor design, operating system principles, and Web server organization all have well-defined best practices that are taught in Computer Science and Electrical Engineering classes everywhere. But what about your own software?
As I discussed in Optimizing with ^C, interrupting your program from time to time as it runs will give you some insight as to where it spends its time. If you look up the call chain, eventually you will reach a higher-level function that is computing or searching for a result. If caching is to work, the result for a given set of parameters passed to the next level down must always be the same. You will want to store these parameters along with the results.
It is quite likely that caching will increase overall memory consumption in your program - this is usually not a strategy to reduce memory consumption (unless the computations themselves use a lot of memory to compute a small result). The amount of information stored per cached result will directly impact how much additional memory you will need. As a result, you may need to pick your caching strategy carefully.
Simple Caching: Storing Commonly Computed Values
I once wrote a generalized radix sort function. Given an offset to an integer key within an anonymous data structure, it made four passes through a list of these data structures, dividing the list into 256 slices in O(n) time. In each pass, a different byte of the key was used. The lists being sorted were large, so the sort function was very fast compared to a "normal" sort which was O(n * log2 n) or O(n * sqrt(n)). Even so, someone challenged me to make it faster: in each pass, the function was extracting the key again. Storing the key within the temporary data structure used for sorting sped up the code by 6%.
This is a crude sort of cache, roughly equivalent in scope to common subexpression removal by a compiler. The memory increase and performance improvement are often relatively small, but improvements like this can add up.
Identifying opportunities to save commonly used values requires a fairly deep understanding of the code you are working with. The commonly used value might have been omitted for a reason. If the value is used during a short period of time but the data structure is long-lived, its cost/benefit ratio may be too low. In the radix sort example, the data structures being sorted were long-lived, but new ones were created just for sorting. It was a trivial matter to add the key just for the duration of the sort. Adding the key to the application's data structure as a base class would have wasted memory for the remainder of execution.
Full Caching: Identifying the Easy Cases
If the returned result is not trivial like the radix sort key, or is computed over several functions, simple caching won't be effective. You will need to study some of the higher-level functions to identify where the results could be stored. The ideal location is an Applications Program Interface (API) of some kind, because it is a choke point: everything goes through it. It has many functions calling it, and it calls many functions. Not only that, an API is an abstraction layer, meaning that callers don't usually know how it works inside, and it can be changed without much impact on callers. Focus your attention on functions like this first.
Sometimes you can determine the most frequent patterns with interrupt-based sampling in the debugger. Often this is not enough, and you will need to annotate a program run. Add temporary code that prints all of its input parameters, then run the program with a moderately sized data set. You want data that is representative but not so extensive that the annotation file is too large to analyze.
Skim the annotation file manually before running statistical analyses. Human eyes are good at seeing subtle patterns that statistics may mask: clusters of repeated results or trends over time. Typical caching strategies allow shifts in stored data over time, so you want to focus on short-term effects rather than statistics gathered over an entire run.
Full Caching: Dealing with the Hard Cases
If your candidate function passes many parameters (or large data structures), or the result it gets is large, caching at this level may not be appropriate unless a large fraction of calls use the same few combinations of parameters. If the number of combinations is too high, you will need to examine the program flow in the functions below it to determine whether one or more subsidiary caches are feasible.
An annotation file can help here. You may find clusters of common parameters within the larger set. This is easier if you modify the annotation report to group parameters according to their use by lower-level code. A high fraction of repeated parameters for a given lower-level function means that a cache for that function can pay off.
You may find that the common parameter subsets are not being passed to lower-level functions, but are being used instead in the higher-level function. Refactor the code at that level to split off the subset calculations, verify that your results do not change, and then add a cache for the parameter subset.
Problems arise if you cannot use a smaller data set for your annotation runs, or if you cannot write a large annotation file. You can modify the annotation code to examine a fraction of the calls (every tenth call, for example) or perform the statistical analysis of parameters within the annotation code. Be careful not to throw too much information away; you don't want to miss improvement opportunities.
You may also have trouble when the results are not quite the same each time, even when the parameters appear to be the same. This can happen in optimization programs, and it means the context you are examining is not broad enough. Usually it is better to look at lower levels of code rather than increase the number of parameters (context) you are examining and storing.
Optimization code can sometimes use caching even when the results are not quite the same each time. The answer you store might be an approximate answer that is close to the best local solution, but is computed much more quickly than the best solution. Although you won't reach the best solution in the local calculation, the time you save here may allow you to try more potential global solutions at the top level of the optimization algorithm, and overall results may improve. But that's the subject of another article.
Implementation of Full Caching
Implementation of an object cache is fairly straightforward, but there are some considerations. If the cache is effective, most of the runtime will be object lookup time. This means lookups must be fast, and typically a hash over the function parameters is the only method fast enough. If there are many parameters, a complex hash function may take too much time to compute, negating your advantage. Keep it simple.
Expect the cache to fill, and have a fast strategy for releasing cache entries when it does fill. Least Recently Used (LRU) algorithms are good, and they can be implemented quickly by linking cached objects together in a doubly linked list. Whenever you find an object in the cache, remove it from its current location (unlink it) and then relink it at the start of the LRU list. If you don't find the object in the cache, delete the last item in the LRU list to make room for the new object.
Depending on the access patterns in your program, you may need to tune the basic LRU algorithm. If many objects are used only once but there is a set of popular objects used repeatedly, you may wish to add an access count to the cache management data structure and use the count as a secondary criteria. If objects keep getting flushed (thrashing), your working set of objects may be too small. Look at the annotated program runs you made earlier (or turn that code back on!) to estimate how many objects you'll need in memory to avoid flushing the object you'll need next.
The cache size is typically a program parameter. You won't always know how much memory will be available for your cache, so leave that task for the system administrator who is installing your program. You can certainly provide good guesses as the default, but you cannot assume that you will have gigabytes of memory available for the cache. Even if the machine has lots of memory, the system administrator might not want to give all of it to your program. Run some sample data sets with varying cache sizes, then plot the runtime savings vs. cache size in the documentation. This will help system administrators allocate resources.
Data caching tends to be a simpler optimization strategy than code rewriting or pattern matching, because you can leave the computation code alone. All you need to do is store the results you are already computing.
Unlike the other two methods, when you cache data you are making an explicit tradeoff between runtime and memory usage. This often requires some balancing on the part of your users, who know their resources better than you will.