Performance

Often minor changes will fix any performance problems in well-designed code. while badly-designed code will require major rewriting.

7.1 A Bottleneck

When solving problems, it's important to ask the right question.

7.2 Timing and Profiling

Automate timing measurement

Most systems have a command to measure how long a program takes. On Unix, the command is called time:

% time slowprogram

real   7.0
user  6.2
sys    0.1
%

This runs the command and reports three numbers, all in seconds: "real" time, the elapsed time for the program to complete; "user" CPU time. time spent executing the user's program; and "system" CPU time, time spent within the operating system on the program's behalf. If your system has a similar command, use it; the numbers will be more informative, reliable, and easier to track than time measured with a stop-watch. And keep good notes.

C and C++ provide a standard routine, clock, that reports how much CPU time the program has consumed so far. It can be called before and after a function to measure CPU usage:

#include<time.h>
#include<stdio.h>
...
clock_t before;
double elapsed;

before=clock();
long_running_function();
elapsed=clock()-before;
printf("function used %.3f seconds\n",elapsed/CLOCKS_PER_SEC);

The scaling term, CLOCKS-PER-SEC, records the resolution of the timer as reported by clock. If the function takes only a small fraction of a second, run it in a loop. but be sure to compensate for loop overhead if that is significant:

before = clock(); 
for (i = 0; i < 1000; i++)
 short_running-function() ;
elapsed = (clock()-before) / (double)i;

In Java, functions in the Date class give wall clock time, which is an approximation to CPU time:

Date before = new Date(); 
long-running_function() ; 
Date after = new Date(); 
long elapsed = after.getTime() - before.getTime();

The return value of getTime is in milliseconds.

Use a profiler

Besides a reliable timing method, the most important tool for performance analysis is a system for generating profiles. A profile is a measurement of where a program spends its time. Some profiles list each function, the number of times it is called, and the fraction of execution time it consumes. Others show counts of how many times each statement was executed. Statements that are executed frequently contribute more to run-time, while statements that are never executed may indicate useless code or code that is not being tested adequately.

Profiling is an effective tool for finding hot spots in a program.The statistics in a profile can be only approximate. (considering sophistication of compilers and the complexity of caching and memory effects,and the fact that profiling a program affects its performance)

The way to use profiling is to identify the critical time-consuming parts of the program, improve them to the degree possible, and then measure again to see if a new hot spot has surfaced. Eventually, often after only one or two iterations. there is no obvious hot spot left.

Profiling is usually enabled with a special compiler flag or option. The program is run, and then an analysis tool shows the results. On Unix, the flag is usually -p and the tool is called prof:

% cc -p spamtest.c -o spamtest 
% spamtest
% prof spamtest

Concentrate on the hot spots.

When a single function is so overwhelmingly the bottleneck, there are only two ways to go: improve the function to use a better algorithm, or eliminate the function altogether by rewriting the surrounding program.

Draw a picture.

Pictures are especially good for presenting performance measurements. They can convey information about the effects of parameter changes, compare
algorithms and data structures, and sometimes point to unexpected behavior.

7.3 Strategies for Speed

Use a better algorithm or data structure

Enable compiler optimizations.

One zero-cost change that usually produces a reasonable improvement is to turn on whatever optimization the compiler provides.

programmers must enable the optimizer explicitly once they believe the program has been debugged.

One thing to be aware of is that the more aggressively the compiler optimizes, the more likely it is to introduce bugs into the compiled program. After enabling the optimizer, re-run your regression test suite. as you should for any other modification.

Tune the code

If the right algorithm is in place, if speed is still an issue the next thing to try is tuning the code: adjusting the details of loops and expressions to make things go faster.

Don't optimize what doesn't matter.

7.4 Tuning the Code

Collect common subexpressions.

If an expensive computation appears multiple times. do it in only one place and remember the result.

Replace expensive operations by cheap ones.

In olden times, this used to mean replacing multiplications by additions or shifts. but that rarely buys much now. Division and remainder are much slower than multiplication. however, so there may be improvement if a division can be replaced with multiplication by the inverse, or a remainder by a masking operation if the divisor is a power of two. Replacing array indexing by pointers in C or C++ might speed things up, although most compilers do this automatically.

Unroll or eliminate loops.

There is a certain overhead in setting up and running a loop. If the body of the loop isn't too long and doesn't iterate too many times. it can be more efficient to write out each iteration in sequence.

Cache frequently-used values.

Cached values don't have to be recomputed. Caching takes advantage of locality, It's best if the caching operation is invisible from outside, so that it doesn't affect the rest of the program except for making it run faster.

Write a special-purpose allocator.

Often the single hot spot in a program is memory allocation, which manifests itself as lots of calls on malloc or new. When most requests are for blocks of the same size, substantial speedups are possible by replacing calls to the general-purpose allocator by calls to a special-purpose one. The special-purpose allocator makes one call to malloc to fetch a big array of items, then hands them out one at a time as needed, a cheaper operation. Freed items are placed back in a free list so they can be reused quickly.

Some algorithms can use stack-based allocation, where a whole sequence of allocations is done, and then the entire set is freed at once. The allocator obtains one big chunk for itself and treats it as a stack, pushing allocated items on as needed and popping them all off in a single operation at the end. Some C libraries offer a function alloca for this kind of allocation, though it is not standard. It uses the local call stack as the source of memory, and frees all the items when the function that calls alloca returns. Buffer input and output. Buffering batches transaction.

Buffer input and output.

Buffering batches transactions so that frequent operations are done with as little overhead as possible, and the high-overhead operations are done only when necessary. The cost of an operation is thereby spread over multiple data values. The drawback is the need to flush output buffers to make data visible; in the worst case, information still in a buffer will be lost if a program crashes.

Handle special cases separately

By handling same-sized objects in separate code, special-purpose allocators reduce time and space overhead in the general allocator and incidentally reduce fragmentation.

Precompute result

Sometimes it is possible to make a program run faster by precomputing values so they are ready when they are needed.

Use approximate values

If accuracy isn't an issue, use lower-precision data types. On older or smaller machines, or machines that simulate floating point in software, single-precision floating-point arithmetic is often faster than double-precision, so use float instead of double to save time. Some modern graphics processors use a related trick. The IEEE floating-point standard requires **"graceful underflow" **as calculations approach the low end of representable values, but this is expensive to compute. For images, the feature is unnecessary, and it is faster and perfectly acceptable to truncate to zero. This not only saves time when the numbers underflow, it can simplify the hardware for all arithmetic. The use of integer sin and cos routines is another example of using approximate values.

Rewrite in a lower-level language

Lower-level languages tend to be more efficient, although at a cost in programmer time. Thus rewriting some critical part of a C++ or Java program in C or replacing an interpreted script by a program in a compiled language may make it run much faster.

Occasionally, one can get significant speedups with machine-dependent code.This is a last resort, not a step to be taken lightly, because it destroys portability and makes future maintenance and modifications much harder. Almost always, operations to be expressed in assembly language are relatively small functions that should be embedded in a library; memset and memmove, or graphics operations, are typical examples. The approach is to write the code as cleanly as possible in a high-level language and make sure it's correct by testing it as we described for memset in Chapter 6. This is your portable version, which will work everywhere, albeit slowly. When you move to a new environment, you can start with a version that is known to work. Now when you write an assembly-language version. test it exhaustively against the portable one. When bugs occur. non-portable code is always suspect: it's comforting to have a comparison implementation.

7.5 space efficiency

Save space by using the smallestpossible data type.

This might mean replacing int with short if the data will fit; this is a common technique for coordinates in 2-D graphics systems, since 16 bits are likely to handle any expected range of screen coordinates.

The logical extension of this approach is to encode information in a byte or even fewer bits, say a single bit where possible. Don't use C or C++ bitfields; they are highly non-portable and tend to generate voluminous and inefficient code. Instead, encapsulate the operations you want in functions that fetch and set individual bits within words or an array of words with shift and mask operations.

Don't store what you can easily recompute.

Changes like these are minor, however; they are analogous to code tuning. Major improvements are more likely to come from better data structures, perhaps coupled with algorithm changes.

Basic idea:** store the common value or values implicitly or in a compact form, and spend more time and space on the remaining values.** If the most common values are really common, this is a win.

Space efficiency concerns sometimes manifest themselves in the external representation of information as well, both conversion and storage. **In general, it is best to store information as text wherever feasible rather than in some binary representation. **Text is portable, easy to read, and amenable to processing by all kinds of tools; binary representations have none of these advantages. The argument in favor of binary is usually based on "speed," but this should be treated with some skepticism, since the disparity between text and binary forms may not be all that great.