The relevance of parallel programming and SIMD vectorization (I)
Computer design is strongly driven by technological, scientific and industrial innovations, always trying to surpass themselves, achieving more efficient and versatile architectures. There are two among the most relevant and with the greatest impact on all programs, determining their potential performance and energy consumption.
Memory hierarchy is one of the most important features of computer engineering, since it tries to solve the bottleneck of von Neumann architectures, which occurs in any data transfer between processors and system memory. However, parallelism is another feature that is becoming increasingly important in computer architectures, especially in the last decade. Modern microprocessors incorporate different levels of parallelism to improve performance. This occurs in practically all processors we know, from CPUs and GPUs to vector devices such as the Intel Xeon Phi many-core or TSUBASA Vector Engine accelerators.
One way to classify the different types of parallelism is the taxonomy of Flynn, who identifies four classes of architectures in relation to the instruction and data flows of computers: SISD, MISD, SIMD and MIMD
- Single Instruction, Single Data (SISD): the traditional von Neumman architecture model where a single processing unit operates on a single data stream. The simplest case, typically found in single-core machines of more conventional computers.
- Single Instruction, Multiple Data (SIMD): a single processing unit operates simultaneously with multiple data sets. We will see this case in this article, through vector operations. Another example would be GPUs with their thread-level parallelism (SIMT).
- Multiple Instruction, Multiple Data (MIMD): uses multiple processing units that execute different instructions on multiple data sets. A particularity of GPUs is that they are devices that have multiple independent cores (hundreds or thousands) that can execute different instructions on different data (usually in sets, wavefronts/warps), although each GPU core operates in SIMD-fashion on its data.
- Multiple Instruction, Single Data (MISD): employs multiple processing units to execute different instructions on a single data stream. Examples of these architectures are systolic arrays and segmented architectures (pipelines), although it can be debatable because the data is transformed per stage. This is the most complex case to find in the real world.
Considering the different characteristics and strategies applied in modern processors, as well as the impact on the exploitation of different levels of parallelism, three application paths can be distinguished:
- Instruction Level Parallelism (ILP). The use of pipelined instructions and superscalar execution facilitates ILP exploitation. Pipelining overlaps the execution of different states of various instructions, such as fetch, decode, register fetch or memory access. Superscalar parallelism allows multiple execution units to execute independent instructions simultaneously. Often the instructions are reordered by the system, performing out-of-order execution. However, data dependencies have to be considered to generate correct programs (typically, read-after-write RAW operations).
- Vector units. Generally, current processors have SIMD units in each core, so they can execute vector instructions that operate simultaneously with data sets. For example, a 256-bit vector unit allows parallel operations with 8 pairs of single-precision floating numbers.
- Multiple cores. Modern microprocessors use MIMD parallelism by incorporating cores or streaming multi-processors that allow threads to execute asynchronously and independently. On modern machines, it is the mechanism that provides the most potential benefits, both in terms of performance, power consumption and scalability. However, programmers need to consider this approach in the programming model, as it generally complicates the program logic due to the non-determinism and the set of concurrent execution streams. Such programs are often complex, affecting all layers of the software design.
Assuming that we want to focus on an improvement of sequential programming, discarding the modifications of the software execution flow, we are going to perform an optimization based on the use of vector units. SIMD architectures exploit data parallelism by issuing the same instruction to all PUs or ALUs (Arithmetic Logical Units) in each clock cycle. In this way, they only require a single control unit to dispatch instructions. In this way, the design of the PUs is simplified, since they do not require additional logic for program control.
For example, if we want to adapt a code that performs a element-wise subtraction of two vectors to a SIMD machine, all iterations of the loop are independent. In this way, it is very easy to apply SIMD.
for (i = 0; i < N; ++i) o[i] = a[i] - b[i];
Considering a control unit operating with a set M of ALUs, it can be determined how the value
o[i] can be computed by first loading
b[i] values into the internal registers A and B, and then performing the subtraction instruction A - B in parallel. This system will be able to perform M operations simultaneously (data loading, execution and storing), determining the increment value/iterator of the loop.
However, not all algorithms are easily applicable in SIMD mode, as is the case with conditional statements, dynamic iterators or with variable jumps, function calls (not inlined) or generally, in situations with logic and flows of considerable complexity.
In a case like the following, it is necessary to allow ALUs to be idle while others operate. Conditional statements are penalizing the efficiency of the system composed of vector units, since the divergence needs to be evaluated in each case. That is, some elements
a[i] could have a value of
1 and others not, making it necessary to execute both branches and maintain wait states between them. In general theoretical terms, on this occasion we would be achieving around 0.5 efficiency. As a curiosity, this is a well known problem in GPU architectures and it is always necessary to try to get the wavefront/warp to execute all its threads in the same divergent branch, in order to improve efficiency (scientific studies on branch divergence).
for (i = 0; i < N; ++i) if (a[i] != 1) o[i] = a[i] + b[i]; else o[i] = a[i] - b[i];
So far we have seen the classification of computer architectures and some first hints of vectorization (SIMD), highlighting the impact of the existing levels of parallelism. In the following articles of this series we will see how to squeeze these parallel architectures and vector units with practical cases of parallel and vector programming using current computer architectures.
It is becoming increasingly common to find devices with more cores, greater diversity in specialized computing units and a variety of coprocessors. The trend is clear: more heterogeneity and potential for exploiting parallelism. Therefore, it is essential that we are aware of its capabilities and advantages, but also of its complexities.
A fundamental aspect is that the implemented software, frameworks and algorithms guarantee performance portability, being able to take advantage of the characteristics of computers, regardless of the number of cores, vector unit width or memory hierarchy configuration. For this reason, programming models and parallel and vector computing techniques are crucial to enable programmers to develop scalable, fast and energy-efficient programs.