# Fast Zero-Latency Convolution

Convolution is the process of filtering one sound using another — the impulse response. As its name transparently suggests, an “impulse response” is the output of a digital filter when fed with an impulse, i.e., a 1 followed by a string of 0’s. And because a digital audio signal can be represented as a sum of scaled and delayed impulses, we can compute the output of a digital filter to an arbitrary input by scaling, delaying, and summing its impulse response (fine print: this only applies to linear time-invariant filters).

The most common use of convolution is for applying reverb to dry recordings. This requires that we first estimate the impulse response of an acoustic space, which can be done in many ways, most naïvely by recording an impulsive sound, say a balloon popping. We can then use the recorded impulse response together with convolution to transfer the spatial impression of a space — for example, a cathedral or concert hall — onto a recording. A great resource for creative commons licensed impulse response recordings is the Open AIR database.

As demonstrated by sound designer Diego Stocco in his excellent feedforward sound design series, convolution processing holds great creative potential as well:

Convolution processing is a technique you can use to enrich and transform any kind of sound

— Diego Stocco

For example, Diego ingeniously uses the sounds of crumpled-up paper and rickety oven doors closing to dynamically shape the timbre of his sounds, leading to beautiful and often cinematic results.

Here I am noodling around with my Korg Volca FM running through the Space Designer plugin in Logic, loaded with the impulse response of some rustling leaves:

And here’s a fun fact: there is a convolution engine built into most modern web browsers, which can be used to render convolution effects client-side using the Web Audio API.

In this post I discuss fast, low-latency approaches that can be used to implement convolution.

## Direct-form Convolution

As I alluded to above, we can write any digital audio signal as a sum of scaled impulses: $$x[n] = \sum_{k=-\infty}^{\infty} x[k] \delta[n-k]$$ where the impulse, $\delta[n]$, is 0 everywhere except when $n=0$, and there it equals 1. If a linear time-invariant system, $T$, has impulse response $h[n] = T(\delta)[n]$, then by the principle of superposition the response to an arbitrary input $x$ will be: $$y[n] = (x * h)[n] \triangleq \sum_{k=-\infty}^{\infty} x[k] h[n-k]$$ The symbol $*$ is the convolution operator. To gain further insight into this equation let’s replace $x[k]$ with its inverse discrete-time Fourier transform (DTFT):

$$y[n] = \sum_{k=-\infty}^{\infty} \Big(\int_{-\pi}^{\pi} X(\omega) e^{j\omega k}d\omega \Big) h[n-k]$$

Rearranging terms we get:

$$y[n] = \int_{-\pi}^{\pi} X(\omega) \Big(\sum_{k=-\infty}^{\infty} h[n-k] e^{j\omega k}\Big) d\omega$$

Making the change of variables $k’ = n-k$ we arrive at:

$$y[n] = \int_{-\pi}^{\pi} X(\omega) \Big(\sum_{k’=-\infty}^{\infty} h[k’] e^{-j\omega k’}\Big) e^{j\omega n} d\omega$$

The term in brackets is nothing more than the DTFT of $h$. Therefore

$$y[n] = \int_{-\pi}^{\pi} X(\omega) H(\omega) e^{j\omega n} d\omega$$

This shows that convolution in the time-domain is equivalent to a multiplication in the frequency-domain, i.e., $Y(\omega) = X(\omega)H(\omega)$. We can think of $H(\omega)$ as eigenvalues of the system since the output at any given frequency is equal to the input at that frequency scaled by $H(\omega)$. This frequency-domain point-of-view underpins techniques for fast convolution.

### Implementation

Direct form application of the convolution equation is only applicable with finite impulse response (FIR) filters, since we can only compute finite sums in practice. For a causal, N-sample FIR filter $h[n] = 0$ if $n < 0$ or $n \geq N$. The convolution equation can then be re-written as:

$$y[n] = \sum_{k=n-N+1}^{n} x[k] h[n-k] = \sum_{k’=0}^{N-1} h[k’]x[n-k’]$$

The equation on the right hand side is perhaps more intuitive from an implementation point of view, and is embodied by the following block diagram:

The following Python code implements the direct form convolution outlined in the block diagram above:

import numpy as np

class DelayLine():

def __init__(self, N):
self.N = N
self.data = np.zeros(N)
self.ptr = 0

return self.data[(self.ptr + ind) % self.N]

def write(self, val):
self.ptr -= 1
if self.ptr<0:
self.ptr = self.N-1

self.data[self.ptr] = val

class DirectFormConvolution():

def __init__(self, hs):
self.hs = hs
self.delayLine = DelayLine(len(hs))

def tick(self, x):
self.delayLine.write(x)

yout = 0
for i,h in enumerate(self.hs):
return yout


The above code could have been written in a much more succinct manner, but I have approached it this way to mimic how I might write the code for real-time implementation (e.g., to prototype the algorithm before porting to C). If we set conv = DirectFormConvolution(hs) then we can process a single output sample by calling y = conv.tick(x). This implementation has zero-latency, but the average computational cost per output sample grows linearly with the length of the impulse response. We can do a lot better than this by using the frequency-domain point of view for convolution mentioned above.

### Overlap-Save Convolution

Using the fact the convolution in the time-domain corresponds to multiplication in the frequency-domain we can dramatically speed up convolution. Let’s assume that we have a length-$M$ impulse response, which has been zero-padded with $L$ samples such that the total length $N = M+L$ is a power of 2. We will perform the convolution on blocks of the input $x_r[n] = x[n - rN]$ for $n \in [0,1,\dots,N-1]$. If we multiply the DFTs of $x_r$ and $h$ and take the inverse transform we get: $$y_r[n] = \sum_{k=0}^{N-1}H[k]X_r[k]e^{j2\pi kn/N}$$ We can see that $y_r[n] = y_r[n+N]$ is an $N$-periodic sequence, which is not quite what we want. This is because the DFT treats all time-domain signals as though they are periodic, which leads to circular (not linear) convolution. The periodic extension of the inputs means that the first $M$-samples of $y_r$ are different from what we would get when performing linear convolution. The remaining $L$-samples, however, are correct. Therefore, in the Overlap-Save method we advance the input and output blocks by $L$-samples at a time, i.e., $x_r[n] = x[n-rL]$, and save $M$-samples from the last block to replace the $M$ incorrect samples at the start of the current block. This is illustrated in the following diagram:

And here is some Python code for the Overlap-Save method:

def nextpow2(x):
return int(2**np.ceil(np.log2(x)))

class OverlapSaveConvolution():

def __init__(self, hs, N):
self.hs = hs
self.M = len(hs)
self.N = nextpow2(N)

assert self.N > self.M
self.L = self.N - self.M
self.Hs = np.fft.fft(hs, self.N)

self.inDelayLine = DelayLine(self.N)
self.outDelayLine = DelayLine(self.L)
self.counter = self.L

def tick(self, x):

if self.counter==0:
# FFT
buf = np.array([self.inDelayLine.read(ind) for ind in range(self.N-1,-1,-1)])
fftBuf = np.fft.fft(buf)

# circular convolution
ifftBuf = np.real(np.fft.ifft(fftBuf * self.Hs))

# copy data into output buffer, discarding the first M samples
for val in ifftBuf[self.M:]:
self.outDelayLine.write(val)

# reset counter
self.counter = self.L

# ---- clock in input samples ----
self.inDelayLine.write(x)

# ---- clock out output samples ----
self.counter -= 1

return yout


Note that we must acquire $L$ samples before we can compute the next $N$-point FFT. While we wait to acquire these samples the code outputs the previous $L$ samples, which means there is an overall latency of $L$ samples. This method requires an average of around $(N/L)\log_2 N$ computations per output sample. For example, if $M=256$, $N=4096$ then we require roughly 12.8 computations per sample, which is about 20 times more efficient that linear convolution.

Unfortunately, as $N$ grows the latency of the system increases. This means that long impulse responses (e.g., for convolution reverb) cannot be used with acceptable latency in the Overlap-Save method. We must turn to partitioned convolution to get the job done.

### Partitioned Convolution

With a uniform partitioning we split the impulse response up into equal size chunks, and run the Overlap-Save method on each chunk in parallel. To get the output to line-up correctly we must delay the input into each chunk by the partition size. The following code illustrates this:

class UniformPartitionedConvolution():

def __init__(self, hs, M, N):
# M is the partition size
# N is the fft size
assert M < N

self.M = M
self.N = nextpow2(N)
self.nparts = int(np.ceil(len(hs)/float(M)))

self.Hs = []
for n in range(self.nparts):
h = hs[n * self.M : (n+1) * self.M]
H = np.fft.fft(h, self.N)
self.Hs.append(H)

self.L = self.N - self.M

self.inDelayLine = DelayLine(self.N * self.nparts)
self.outDelayLine = DelayLine(self.L)
self.counter = self.L

def tick(self, x):

if self.counter==0:

fftBuf = np.zeros(self.N, dtype='complex')

for n in range(self.nparts):
# FFT
buf = np.array([self.inDelayLine.read(ind + n*self.M) for ind in range(self.N-1,-1,-1)])

# circular convolution
fftBuf += np.fft.fft(buf) * self.Hs[n]

ifftBuf = np.real(np.fft.ifft(fftBuf))

# copy data into output buffer
for val in ifftBuf[self.M:]:
self.outDelayLine.write(val)

# reset counter
self.counter = self.L

# ---- clock in input samples ----
self.inDelayLine.write(x)

# ---- clock out output samples ----
self.counter -= 1

return yout


The uniform partitioned convolution has a input output latency equal to the FFT size minus the partition size. This will be a big improvement on the Overlap-Save method for long impulse responses. But what if we want zero-latency? The code below shows one way to achieve this:

class NonUniformPartitionedConvolution():

def __init__(self, hs, M, N):

# Take M to be the size of the first stage
self.M = nextpow2(M)
self.nparts = int(np.ceil(np.log2(len(hs))) - np.log2(self.M))
self.hs = np.hstack([hs, np.zeros(self.M * 2**(self.nparts) - len(hs))])

self.stages = [DirectFormConvolution(hs[:self.M])]
for n in range(self.nparts):

blockSize = self.M * 2**n
sup = range(blockSize, 2*blockSize)
self.stages.append(OverlapSaveConvolution(self.hs[sup], 2*blockSize))

def tick(self, x):

yout = 0
for s in self.stages:
yout += s.tick(x)
return yout


In essence we apply direct form convolution on the first partition (which should be chosen to have a very small size). Running direct form convolution on the first partition buys us time to collect samples for the 2nd partition, which is performed using the Overlap-Save method. Similarly, while the data from the first 2 partitions is being output, we have time to collect samples for the 3rd partition, and so on. This idea was presented in1, which is worth reading because it also discusses how to better distribute the CPU load across the partitions.

### Convolution Showdown

So what can we expect in terms of CPU usage for each of these techniques? The following graph illustrates the results of a convolution showdown, i.e., the CPU time versus impulse response length for a single run of each method.

As we can see, for all but very short impulse responses direct form convolution is very slow. Overlap-Save is the fastest algorithm, but suffers from the greatest latency too (its latency is proportional to the impulse response length). The zero-latency and uniform partitioned convolution offer a compromise between these two extremes: fast convolution with fixed (or no) latency.

— Corey

P.s. you can get in touch with me at: info _at_ reverberate.ca

1. Gardner, W.G., 1994, November. Efficient convolution without input/output delay. In Audio Engineering Society Convention 97. Audio Engineering Society. [return]