Overview of Code Optimization Objectives
In Group4Layers, whenever the client gives us freedom or we carry out internal projects, we base our developments on several fundamental pillars, enhancing from code quality and scalable architectures to maintenance and easy traceability, but of course, efficiency. Our libraries and released software, such as ExImageInfo, are characterized by these fundamentals. The result? Years go by and we are still the creators of the fastest (and most supported) image processing library in the Erlang/Elixir ecosystem. No wonder it has positioned itself as the most popular library, used by individuals and companies around the world. Articles like this one introduce the complex but interesting work of code optimization.
The fundamental goal when optimizing a program is to identify those events or regions that significantly impact the performance of a program. This will determine which ones we will need to consider, capture and evaluate. However, we must also keep in mind that we are interested in those events that occur more frequently and that can be conveniently manipulated through code variations, as long as they do not alter the semantics of our programs.
We consider the execution time of a program, defined as the product of the number of instructions executed, the average number of cycles per instruction (CPI) and the clock period:
Since the clock period is generally fixed (except for configurations or systems with dynamic variations), code optimization is usually dependent on reducing the CPI and the number of instructions executed.
The reduction of the number of instructions executed is generally determined by the reduction of repeated and unnecessary code. Some examples are storing and reusing the result of expressions to be executed; extraction of invariant code that is internal to loops; or refactoring functions and methods, moving common regions to reusable zones between them. Generally the compiler is able to automatically eliminate such redundancies in the analysis of high-level code. It is true that there are occasions when the compiler will generate more instructions than necessary when transforming high-level code into assembly code, and this is where performance can be improved by inserting special assembly code regions. This practice is common for performance-critical regions, usually where the repetition (execution) rate is highest, such as inside nested loops or in constantly used functions and operators.
It is of vital importance to have objective performance metrics, such as counters of number of instructions executed (retired), although it is important to keep in mind that processors are very complex pieces, and in most occasions we may be before an estimated and not exact number (e.g. jump predictor and jump speculation), but sufficient for most optimization cases and most common purposes. There are many tools to study and analyze the behavior of our programs. Some of the ones we have used range from open source software such as Linux Perf, GProf or Valgrind, to proprietary and vendor-specific ones such as AMD uProf, AMD CodeXL or Intel VTune, among others.
Analyzing retired instructions, repetition of code regions and bottlenecks of a module focused on cryptographic computation. Using KCacheGrind to inspect a set of execution traces offered by Valgrind's Callgrind tool.
On the other hand, reducing the average number of CPIs is somewhat more complex. First of all, processors vary in their hardware designs, and not only manufacturers but each architecture and series comes to work in quite different ways. Moreover, the theoretical model and specification of each processor may indicate one CPI number, but the actual execution conditions may indicate a quite different one, usually lower. This is due to stall events, i.e. cases in which an instruction is temporarily not executed in the processor due to a series of restrictions. The processor remains in a kind of wait, and execution slows down, decreasing the maximum number of instructions launched in subsequent cycles. Stalls are mainly produced by four types of cases:
- Failures in jump prediction (branch predictor).
- Failures in the indirection of calls and jumps during execution (branch target buffer, BTB), typical in high levels of abstraction, pointers to functions and inheritance in OOP.
- Data dependencies, caused by RAW dependencies (read after write) and with great penalty in instructions with high latencies (e.g. floating point).
- Load/store, involving all kinds of cache misses, especially when external (off-chip) memory access is necessary.
Now that we have introduced the major goals and problems in code optimization, we can focus on strategies on the two major fronts mentioned here. It opens up an interesting field of possibilities to exploit, however, it will be better to detail it in future articles.