Loop unrolling, or can also be called loop unwinding, is the most fundamental technique in the parallel programming world, and it is also powerful. The idea is to break down a loop into a sequence of identical operations, and attach the data index pointer accordingly.

On a serial processor, loop unrolling can fasten the execution because it reduces the loop control instructions, such as pointer arithmetic and branching instructions. In most cases, this type of software pipelining can also induce the advantage of pipeline-enabled processors, which is to hide the latency of data read-write.

Sample snippet:

1
2
3
  for (int i = 0; i < N_DATA; i++) {
      a[i] = alpha * b[i] + c[i];
  }

Loop unrolling appliance:

1
2
3
4
5
6
  for (int i = 0; i < N_DATA; i+=4) {
      a[i+0] = alpha * b[i+0] + c[i+0];
      a[i+1] = alpha * b[i+1] + c[i+1];
      a[i+2] = alpha * b[i+2] + c[i+2];
      a[i+3] = alpha * b[i+3] + c[i+3];
  }

On a device where more than one processor can execute the same instruction (i.e. parallel processors), each iteration in a loop is given to a single processor, and the next iteration is then enqueued.

The most important thing to note when dividing the workload to each thread, is the most infamous data dependency. If you are not conscientious enough when resolving this problem, it may lead to a wrong execution result, or slow execution speed. In this article, iteration in the sample snippet does not have any data dependency to the previous execution, so I will keep this topic for later.

OpenMP Implementation:

1
2
3
4
#pragma omp parallel for
  for (int i = 0; i < N_DATA; i++) {
      a[i] = alpha * b[i] + c[i];
  }

CUDA implementation:

1
2
3
4
5
__global__
void saxpy(float *a, float *b, float *c, float alpha) {
     int i = threadIdx.x + blockDim.x * blockIdx.x;
     a[i] = alpha * b[i] + c[i];
}

OpenCL implementation:

1
2
3
4
5
__kernel
void saxpy(__global float *a, __global float *b, __global float *c, float alpha) {
     int i = get_global_id(0);
     a[i] = alpha * b[i] + c[i];
}