The Design and Implementation of FFTW3

Back

About

Paper on the implementation of FFTW3.

FFTW3 is a DFT algorithm that computes FFT in $O(NlogN)$

DFT

DFT is given as $Y[k] = \sum_{j=0}^{n-1}X[j]e_n^{2 \pi i / n}$

The complexity of DFT is $O(N^2)$

FFTW

I/O tensors

DFT is expressed in I/O tensors, which is in turn expressed in terms of IO dimensions.

I/O dimension d is a triple $d = (n, i, o)$, where n is non-negative integer length, i is the input stride and o is the output stride
I/O tensor is a set of I/O dimensions $t = \{d_1, d_2, ... dn\}$

$dft(N, V, I, O)$ consists of 2 I/O tensors and two pointers I, O.

$dft(\{\}, \{\}, I, O)$
$O(0) := I(0)$

$dft(N, \{(n, i, o)\} \cup V, I, O)$ is a loop of n problems $0 \le k \lt n$ in $dft(N, V, I + k . i, O + k .o)$

N is the size of the problem, V is the vector size.
Or $|V|$ nested loops of $|N|-dimensional FFT$

$copy\_i(t)$ is a I/O tensor $${(n, i, i)	(n, i, o) \in t}$$
$copy\_o(t)$ is a I/O tensor $${(n, o, o)	(n, i, o) \in t}$$

The two pointers I and O specify memory locations of input and output. if I = O then the problem is in place. A I/O tensor for which $i_{k}$ = $o_{k}$ for all k is said to be in-place

DFT as I/O tensors

A $n_1 \times n_2$ matrix is stored as row major $n_2$ contiguous arrays for each row, stored in $n_1$ consecutive blocks, it can be represented as $X = \{(n_1, n_2, n_2), (n_2, 1, 1)\}$

2D DFT for this matrix is finding the rank-2, vector-rank-0 $dft(X, \{\}, I, O)$ or $dft(\{(n_1, n_2, n_2), (n_2, 1, 1)\}, \{\}, I, O)$

$dft(\{\}, X, I, O)$ denotes copy from input to output, or $dft(\{\}, \{(n_1n_2, 1, 1)\}, I, O)$

Plans in FFTW

Solve a DFT represented as I/O tensors

Reduce a problem of arbitary vector rank to vector rank 0
Reduce multi dimensional DFT to rank-1 problems(1D FFT problems).
Solve rank 1, vector 0 problem by means of some algorithm

A plan involves finding the best permutation of these steps

no-op plans

No operations are done

rank-0 plans

$$ V = 1, I \ne O $$ input is copied to output
$$ V = 2, I = O$$ matrix transposition problem

rank-1 plans: 1D Fourier transforms

direct plans

When n is small enough direct plan is used

$dft(\{n,i,o\}, V, I, O) where |V| \le 1, n \in {2, 4, ... 64}$ FFTW has codelets to solve for these low n values. User to add hand-written machine-specific codelets if desired

=0$is straight line code.$

=1$$ is loop wrapped straight line code

Cooley-Tukey Plans

For the type $dft(\{n,i,o\},V,I,O), n=rm$ radix-r Cooley-Tukey plan.

Planner generates several values of r and chooses the best one.

Decimation in Time plan: radix $n_2$ plan
1. Solve $dft(\{(m,r.i,o)\},V \cup \{(r,i,m.o)\},I,O)$
2. Multiply output O by twiddle factors
3. Solve $dft(\{(r,m.o,m.o)\},V \cup \{(m,o,o)\},O,O)$ 2 and 3 are fused together in a codelet
Decimation in Frequency plan:
1. Solve $dft(\{(r,m.i,m.i)\},V \cup \{(m,i,i)\},I,I)$
2. Multiply input I by twiddle factors
3. Solve $dft(\{(m,i,r.o)\},V \cup \{(r,m.i,o)\},I,O)$ 1 and 2 are fused together in a codelet
DIF destroys input buffer

Higher ranks

Reduce to lower ranks and solve recursively

To solve $dft(N,V,I,O)$, where $$N=N_1 \cup N_2,

N_1

\ge 1

N_2

\ge 1$$

Solve $dft(N_1, V \cup N_2, I, O)$, then solve $dft(copy\_o(N_2), copy\_o(V \cup N_1), O, O)$

Higher vector ranks

To solve $dft(N,V,I,O)$, where $V=\{(n,i,o)\} \cup V_1$

Loop over $0 \le k < n$ and solve $dft(N,V_1,I+k.i,O+k.i)$

Indirect plans

Indirect plans transform DFT problems that require shuffling into a problem that requires no shuffling + rank-0 problem that requires shuffling

$dft(N,V,I,O) |N| \ge 0$ can be solved by solving $dft(\{\}, V, I, O)$ and then solving $dft(copy\_o(N), copy\_o(V), O, O)$

Other plans

Buffer plans solve out-of-place or non-contiguous memory problems.
Real/Imaginary plans - solving read and imaginary separate then combine them
Parallel - Distributed and Threaded

Avenues to explores

User codelets

\(copy\_i(t)\) is a I/O tensor $${(n, i, i)	(n, i, o) \in t}$$
\(copy\_o(t)\) is a I/O tensor $${(n, o, o)	(n, i, o) \in t}$$