View on GitHub

CV-Sandbox

Just a random assortment of computer vision projects.

Basic Matrix Convolution Implementation Suite

With Benchmarking Tools

Table of Contents

Matrix Convolution

Introduction

Convolution is an operation involving two matrices: the image and the kernel. During this operation, elements in the convolved matrix are the sum of all nearby values in the image matrix weighted by the kernel. Usually, the kernel matrix is significantly smaller than the image.

To demonstrate the process of convolution, I’ll convolve two small, 1-dimensional matrices in the following figures. The input matrix (green) is [3,1,4,1,5,9,2], and the filter (red) [1,2,-1].

To get the first output of the matrix, add all the surrounding elements in the corresponding input array, weighted by the kernel.

Element 1:

Note that the output matrix must be shifted to the right, so its corresponding element in the input matrix has neighbors in both directions.

The process is repeated for each element in the output matrix by sliding the filter over the input image.

Element 2:

Element 3:

Element 4:

Element 5:

Aplications

Two common uses for matrix convolution are in image processing and artificial intelligence. Both of these applications are described in the sections below:

Image Processing

Many common image filters use matrix convolution. They do so by converting the image into a 3D matrix (width, height, channels, explained in greater detail in the next section)

A small filter (also known as a kernel) is slid across an image to produce an output image. Cleverly constructed filters can blur and sharpen the image, find corners and edges, and more.

Artificial Intelligence

Matrix convolution allows efficient implementations of neural network approximations. Each neuron is an element in a matrix. Convolving the matrix simulates the synapses between neurons. The filter represents the sensitivity of each simulated synapse. By adjusting the values in the filter, each synapse’s sensitivity changes, affecting the output from the neural network. Software like TensorFlow contain programs that update these filters, a process known as training.


Benchmarking

Algorithm

The process of convolution in this benchmark is an extension of the example in the previous section in a few ways. These extensions were included in the benchmark to increase parallelism complexity for the final project and to support common features in convolutional neural networks.

The diagram below highlights the cells in each matrix that are used during the first operation.

Because this convolution algorithm is targeted towards convolutional neural networks, terminology related to neural networks will be used in the program and explanation. For example, “input matrix” becomes “input feature map,” “filter” or “kernel” can become “weights,” etc.


Pseudocode

convolve(I, W, B, O):
    Parameters:
        I: the input feature map, a four dimensional array with dimensions [N][S+Q\*U][R+S\*U][C]
        W: the weights, a four dimensional array with dimensions [S][R][C][M]
        B: the bias, a one dimensional array with dimension [M]
    Returns:
    O: the output feature map, a four dimensional array with dimensions [N][Q][P][M]  
    Reset O so that each element is 0

    For each n in N:
        For each q in Q:
            For each p in P:
                For each m in M:
                    For each s in S:
                        For each r in R:
                            For each c in C:
                                Increase O[n][q][p][m] by I[n][q*U+s][p*U+r] * W[s][r][c][m]
                    Increase O[n][q][p][m] by B[m]  (Bias)
                    O[n][q][p][m] = f(O[n][q][p][m]) (Apply the activation function)

Benchmarking Algorithms

The benchmarking framework contains nine variations of the convolution algorithm described in the previous section, described below:

void conv4d_convolve_serial_naive();

This algorithm convolves the input feature map by following the pseudocode verbatim, ignoring the machine’s hardware.

void conv4d_convolve_serial_discrete();

This algorithm improves in performance because it splits the loops for convolution and bias, giving the compiler more flexibility to perform its assembly-level optimizations.

void conv4d_convolve_serial_tiled(int block_size);

This algorithm attempts to further improve the performance by tiling the arrays, theoretically increasing cache hits.

void conv4d_convolve_threads_discrete();

This algorithm is similar to conv4d_convolve_serial_discrete, except that the output feature map’s rows are distributed among threads using the Pthread library.

void conv4d_convolve_threads_tiled(int block_size);

This algorithm is similar to conv4d_convolve_serial_tiled, except that the output feature map’s rows are distributed among threads using the Pthread library.

void conv4d_convolve_OpenMP_discrete();

This algorithm is similar to conv4d_convolve_serial_naive, except that it uses OpenMP to distribute the workload between threads. Each element in the output feature map is its own task and is run in parallel with other elements.

It was based off of the naive version because the discrete version created a race condition.

void conv4d_convolve_OpenMP_tiled(int block_size);

This algorithm is similar to conv4d_convolve_serial_tiled, except that it uses OpenMP to distribute the workload between threads. Each element in the output feature map is its own task and is run in parallel with other elements.

However, instead of tiling over q and p, tiling is performed over c so that the first three for loops can be collapsed. Again, this also helps remove the race condition, avoiding the need of a critical region.

void conv4d_convolve_CUDA_discrete(int block_size, int grid_size);

This algorithm is similar to conv4d_convolve_serial_discrete, except that it uses CUDA to distribute the workload between threads on the GPU. Each element in the output feature map is its own task and is run in parallel with other elements. Blocking happens naturally due to Nvidia GPUs’ architectures.

void conv4d_convolve_CUDA_discrete_rewrite_gpu_data(int block_size, int grid_size);

This algorithm is conv4d_convolve_CUDA_discrete, except that its memory policy is a bit different.

The CUDA functions use their own global, device-specific memory. To convolve a matrix in CPU memory with CUDA, this program copies that matrix and the filter over to the device memory. Then a GPU kernel can run on the GPU memory. Finally, after the kernel completes, the output matrix (in GPU memory) is copied back to the CPU.

This function copies the input feature map and layer from the CPU memory into the GPU before running conv4d_convolve_CUDA_discrete, costing time.

Benchmarking Framework

In addition to all nine functions, the benchmarking framework also contains tools to profile each algorithm and verify its correctness.

File Structure

The project contains seven files:

All files are placed in the folder src/10_Convolution_Benchmark.

Data Structures

The convolutional layer and each feature map have their own data type and are stored in global memory. They’re defined in conv4D_data_structures.h.

Benchmarking Build Procedure

Process

Step 1: Download a copy of the full project from the GitHub repository: https://github.com/m516/CV-Sandbox

Step 2: In the root directory of the repository, run the following command:

cmake .

Step 3: Build the tool with make if you’re using Linux or Mac:

$ make 10_CONVOLUTION_BENCHMARK

Step 4: Run the program (contained in the bin subdirectory of the project folder)

$ bin/10_CONVOLUTION_BENCHMARK

Hardware Used for Benchmarking

I used a Lenovo Yoga 730 to perform benchmarking. It has the following hardware components:

The benchmark attempts to use as many resources as possible. OpenMP and Pthread implementations both take all 8 hardware threads.

Software Used for Benchmarking

Results

TinyImageNet Neural Network Layer

The first benchmark is based on an intermediate feature map that was extracted from a TinyImageNet neural network running on Tensorflow. The extracted feature map and convolutional layer data were placed in binary files under the media/dnn directory in the project folder. To ensure that all algorithms work as expected, the output feature map was also extracted into a file, and the values in the calculated feature map are compared with the corresponding values in the file.

These are the dimensions of the input feature map:

Here is the time required to compute each algorithm with this data (highest performing algorithm in bold):

Algorithm

Average Execution Time (s)

Error per Element in Output Feature Map

Speedup from serial_naive

Speedup from serial_discrete

GFLOPS

serial_naive

0.0858

0

1.00

0.09

1.87

serial_discrete

0.0080

0

10.77

1.00

20.16

serial_tiled

0.0550

0

1.56

0.14

2.92

threads_discrete

0.0039

0

22.19

2.06

41.52

threads_tiled

0.0035

0

24.60

2.28

46.03

OpenMP_discrete

0.0022

0

38.18

3.54

71.43

OpenMP_tiled

0.0170

0

5.06

0.47

9.46

cuda_discrete

0.0369

0

2.32

0.22

4.35

According to this table, OpenMP and threading both significantly speed up the convolution program when used effectively, but OpenMP was more effective.

Tiled Performance

Most implementations of tiled convolution had worse performance than their discrete counterparts. When GCC optimized the for loops, it tried to distribute the instructions to minimize cache misses. It was able to optimize the chain of for loops very well, and it automatically vectorized operations with SIMD. My tiling interfered with many of those optimizations, reducing the overall performance of the algorithm.

In other words, GCC was much better at optimizing the loops for my machine than I was. Boy, I still have a lot to learn!

However, tiled convolution performed better than discrete convolution on Pthreads. I believe this is because the DRAM bandwidth imposes a bottleneck on this program, and tiling gave the program the opportunity to use data in the cache multiple times. That would explain the spike that occurs when block size is 6.

OpenMP Performance

Auto, static, and guided scheduling methods all performed about the same on this benchmark, and dynamic scheduling took significantly longer than the other three. That is expected with this algorithm because every task takes almost the same amount of time, since conditional branching is minimal.

I was unable to use Allinea or MAP because I don’t have the programs on my machine. I was able to profile the CUDA algorithms because the CUDA development kit came with its own tool: nvprof.

CUDA Performance

CUDA did not perform very well, but that’s likely because I don’t have much experience in the language or framework. However, the algorithm was parallelizable since latency generally decreased when grid_size and block_size increases.

Profiling

Below is the output from profiling the discrete CUDA algorithm with a block size of 4 and a grid size of 4.

According to the tool, synchronizing all the data between the CPU and GPU memory took about 10 ms. Convolution took 120x longer on average than memory synchronization.

The average throughput was 2.17 FLOPS, and the peak theoretical performance is 1.8 TFLOPS (source: TechPowerUp). This suggests that my algorithm doesn’t efficiently use the hardware resources in the graphics card.

Conclusions

There is much room for improvement in all the algorithms that were benchmarked in this project.

Memory is a bottleneck for these algorithms because discrete convolution doesn’t allow stride-1 array accesses to all matrices. In this project, I haven’t covered at least two advanced optimization techniques that can dramatically improve memory access times by maximizing or enforcing stride-1 array accesses and data reuse.

Toeplitz Matrices

Matrix convolution can be transformed into matrix multiplication by converting one of the matrices into its Toeplitz matrix (source: Wikipedia). An additional latency is incurred by the conversion process, but the benefit of enforcing higher amounts of stride-1 array accesses would outweigh the cost of that latency in sufficiently large matrices.

Fourier Transform

Convolutions of large matrices are usually implemented most quickly using the Fourier transform (source: ccrma.stanford.edu) because only element-wise multiplication is necessary in the Fourier transform domain. In other words, it’s possible to take the Fourier transform of both, perform element-wise multiplication, then perform the inverse Fourier transform to convolve two matrices (http://pillowlab.princeton.edu).