# Fast convolutions (II)

I will analyze the algorithmic complexity of the convolution algorithms described in my previous posts.

To make things simpler, let’s assume that the dimensions of the image are `>=`

the dimensions of the convolution kernel, and that both are square, with dimensions `S·S`

and `s·s`

, respectively.

## Naive algorithm - O(n^4)

*“wxh operations for each of the W·H image pixels”.*

*i.e.,* `S·S·s·s`

operations. This is quadratic with a heavy constant for tiny kernels, but quickly becomes quartic for medium-to-large kernels.

The auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

## Separable convolution - O(n^3)

*“One 1D convolution for each row + One 1D convolution for each column”.*

*i.e.,* `2·S·S·s`

operations. This is quadratic with a bad constant for small kernels, but becomes cubic for large kernels.

Remarkably, the total running time depends on the dimensions of the image -and- the dimensions of the kernel.

Again, the auxiliary buffer can be identical in size and bit-depth to the original image. So the memory usage factor is 2x.

## Convolution theorem - O(n^2·log(n))

*“Two FFTs + One point-wise product + One IFFT”.*

Let’s call `S’`

to the closest power-of-2 such that `S’>=S`

. Then a proper implementation of the FFT/IFFT does (approx.) `2·S’·S’·log(S)`

operations, while the point-wise product does `S’·S’`

operations. This makes the algorithm `O(S’·S’·log(S’))`

with some heavy (quadratic) overhead due to the memory copying, padding, and the point-wise product.

Remarkably, the total running time is independent of the size of the kernel.

This algorithm is quite memory hungry, though, because two `S’·S’`

complex-number buffers are required. This means two floating-point numbers per entry (*i.e.,* the real/imaginary coefficients). The algorithm starts by copying the image/kernel to the real part of the corresponding complex-number buffer, leaving the imaginary coefficient and the surface excess filled with zeroes. Then the FFTs/product/IFFT happen in-place.

So the auxiliary memory required is `4·S’·S’`

floating-point numbers.

In the worst case where `S`

is a power-of-2-plus-1, `S’`

gets nearly twice as large as `S`

. If the original image is 8-bit and we are using single-precision floating-point math for the FFT/IFFT, this means a memory usage factor of 64x. In the case of an HDR (single-precision floating-point) grayscale image, we are speaking of a worst case scenario of 16x. In average, however, the memory usage factor is around 8x. If `S`

is a power-of-2, then the memory usage factor goes down to 4x.

*Heavy glare using the FFT-based method in an HDR image by Paul Debevec*

This image with heavy glare has been output with some ArionFX-related experimental tonemapping code I am working on these days.

## Conclusions:

Assuming that we are only interested in sheer performance:

- The FFT-based method is (by far) the fastest for large images/kernels. Interestingly, the algorithm is not affected by the size of the kernel, which can be as large as the (padded) image itself without a penalty.
- The FFT-based method becomes even faster if the same kernel is applied multiple times. The kernel FFT can be calculated just once, and then be re-used.
- Due to the heavy setup overhead in the FFT-based method, the separable method can be faster for small (separable) kernels where
`s`

is in the range of`log(S’)`

.

Last, but certainly not least, **there is a much faster and more light-weight algorithm for the special case of Box/Gaussian Blur**. I will talk about this in a separate blog entry.