Finite Impulse Response Filters Using Apple's Accelerate Framework - Part II

March 26, 2014

In Part I of this 3-part series we created a simple FIR filter. In this part we will look at some ways of improving the implementation using some functions from the Accelerate Framework.

The Accelerate Framework

The accelerate framework provides a C API with functions for linear algebra, matrix math, DSP, and image processing. In this case we will specifically be focusing on the vDSP functions. The functions provide vectorized implementations of common functions using specialized instruction sets (SSE, NEON, etc) or DSP hardware on your processor.

Improving The Filter

The basis for our filter is the convolution function. This is an extremely slow function, it runs in quadratic time (O(N²)). Looking at the nested loops doing multiplication and accumulation should give you some intuition for why this is the case. This is where we are going to focus our attention. Fortunately, the accelerate framework provides a convolution function, using vectorized multiply-accumulate instructions. The function we are going to use is vDSP_conv (or its double-precision equivalent vDSP_convD

The prototype for vDSP_conv is as follows:

void vDSP_conv (
   const float __vDSP_signal[],
   vDSP_Stride __vDSP_signalStride,
   const float __vDSP_filter[],
   vDSP_Stride __vDSP_strideFilter,
   float __vDSP_result[],
   vDSP_Stride __vDSP_strideResult,
   vDSP_Length __vDSP_lenResult,
   vDSP_Length __vDSP_lenFilter
);

This matches up fairly closely to our convolution function prototype, with the addition of the Stride arguments, which is used to set the direction and step size when looping through the input and filter arrays. There are a few things we need to look out for when using this function in our filter code. Per the vDSP_conv documentation, The input signal length must be at least (length of filter + length of result – 1)[1]. We need to set __vDSP_strideFilter to -1 [2], as well as pointing __vDSP_filter at the end of our filter array in order to perform convolution [3].

// Pointer to end of filter for use with vDSP_conv [3]
float    *h_end = h + (h_length - 1);
 
// Length of signal passed to vDSP_conv [1]
unsigned signal_length = (h_length + result_length);
    
// Create an array to store the signal passed to vDSP_conv, padded with zeros[1]
float padded[signal_length];
 
// fill padded buffer with zeros
unsigned i; 
for (i = 0; i < signal_length; ++i)
{
    padded[i] = 0.0;
}
 
// Copy input into padded buffer
for (i = 0; i < x_length; ++i)
{
    padded[i] = x[1];
}    
 
 
// use the Accelerate convolution function [2]
vDSP_conv(padded, 1, h_end, -1, dest, 1, result_length, h_length);

We can eliminate some more of the loops in this code using some more vDSP functions as well a BLAS function from the Accelerate Framework. The first loop which fills our padded buffer with zeros can be replaced with vDSP_vfill. Here is the code we can use to replace the first loop:

// vDSP_vfill takes a pointer to the value used to fill the destination array. Create a variable
// holding 0.0 which we can point to
float zero = 0.0;
 
// Fill the padded buffer with zeros
vDSP_vfill(&zero, padded, 1, signal_length);

The second loop can be replaced with a call to cblas_scopy, which copies a vector from one location to another:

// use cblas_scopy to copy the input vector into our padded vector.
cblas_scopy(x_length, x, 1, padded, 1);

Here is our updated FIR filter using our vectorized convolution code. I’ve also replaced a loop which adds two vectors with vDSP_vadd, which is a fairly straightforward replacement:

#include <Accelerate/Accelerate.h>
 
/* filter signal x with filter h and store the result in output. 
 * output must be at least as long as x
 */
 
void         
fir_filter( const float *x, 
            unsigned x_length, 
            const float *h, 
            unsigned h_length, 
            float *output)
{
    // Create buffer to store overflow across calls
    static float overflow[h_length - 1] = {0.0};
 
    // The length of the result from linear convolution is one less than the
    // sum of the lengths of the two inputs.
    unsigned result_length = x_length + h_length - 1;
    unsigned overlap_length = result_length - x_length;
 
    // Create a temporary buffer to store the entire convolution result
    float temp_buffer[result_length];
 
    // Pointer to end of filter for use with vDSP_conv
    float    *h_end = h + (h_length - 1);
 
    // Length of signal passed to vDSP_conv
    unsigned signal_length = (h_length + result_length);
    
    // Create an array to store the signal passed to vDSP_conv, padded with zeros
    float padded[signal_length];
 
    // fill padded buffer with zeros
    float zero = 0.0;
    vDSP_vfill(&zero, padded, 1, signal_length);
 
    // Copy input into padded buffer
    cblas_scopy(x_length, x, 1, padded, 1)
 
    // use the Accelerate convolution function
    vDSP_conv(padded, 1, h_end, -1, temp_buffer, 1, result_length, h_length);
       
    // Add the overlap from the previous run
    // use vDSP_vadd instead of loop
    vDSP_vadd(temp_buffer, overflow, buffer, overlap_length);
    
    // Copy overlap into overlap buffer
    // use BLAS copy instead of loop
    cblas_scopy(overlap_length, temp_buffer + x_length, 1, overflow, 1);
 
    // write the final result to the output. use BLAS copy instead of loop
    cblas_scopy(x_length, temp_buffer, 1, output, 1);
}

In Part III, I will walk through another improvement which leverages the fast fourier transform to take our convolution from \(O(N^2)\) to \(O(N*log(N))\).