Case Study Gridding
Case Study Gridding
SoA and AoS
AoS
std::complex<float> arr[N];
Modern compilers have techniques for vectorizing AoS implementations
Pros:
- Easy for developers to use the structure
Cons:
- Harder vectorization
SoA
template <typename T> struct SOAComplex {
T *real;
T *imag;
};
SoA implementation is always faster than AoS as the vectorization is straight forward.
Pros:
- Easy vectorization
Cons:
- Harder for developers to use the structure
Tiled SoA or AoSoA
This is a compromise between vectorization and user experience
template <typename T> struct complex2 {
alignas(8) T imag[2];
alignas(8) T real[2];
};
complex2<float> arr[N]
Pros:
- Easy vectorization
Cons:
- Slower than SoA
- CPU has to load the entire structure, there may be cache related issues
SIMD Tiled SoA
In this apprach we do the tiling for Tiled SoA instead of the compiler
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 44999.9 ms
CPP Wallclock SIMD 2 Tiled gridding 3229.8 ms
CPP Wallclock SIMD 4 Tiled gridding 4435.18 ms
All assertions passed for SSE
All assertions passed for AVX
Pros:
- No change to developers code
- Vectorization would be more efficient
Cons:
- Intrinsics code has to be writted separately for each CPU feature(SSE, AVX, FMA)
fcx-limited-range
yandasoft uses -fcx-limited-range
which informs the compiler that multiplication, division, and absolute value of complex numbers may use simplified mathematical formulas (x+iy)Γ(u+iv) = (xu-yv)+i(yu+xv)
, this greatly optimises the gridding code
In gondor the difference between the having the flag is 24x faster, which means we have to scale up our STSOA implementation.
This also means that the pointer implementation and the AoS implementation have the same performance.
With -fcx-limited-range
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 1839.24 ms
CPP Wallclock SIMD 2 Tiled gridding 3188.1 ms
CPP Wallclock SIMD 4 Tiled gridding 3578.35 ms
All assertions passed for SSE
All assertions passed for AVX
OpenMP simd pragma
The omp simd pragma can collapse loops.
#pragma omp simd collapse(2)
for (int voff = 0; voff <= 2*support; voff++) {
for (int suppu = -support; suppu <= support; suppu++) {
const int suppv = voff - support;
const int uoff = suppu + support;
CFloat wt = convFunc(voff, uoff);
grid(iv + suppv, iu + suppu) += cVis * wt;
}
}
Impact of loop collapse
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 44949.5 ms
CPP Wallclock Std gridding variant 14765 ms
CPP Wallclock SIMD 2 Tiled gridding 3222.85 ms
CPP Wallclock SIMD 4 Tiled gridding 4427.25 ms
The loop collapse variant is 3x faster than the normal std::complex gridding
With -fcx-limited-range
we are still slower
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 1838.88 ms
CPP Wallclock Std gridding variant 17223.8 ms
CPP Wallclock SIMD 2 Tiled gridding 2735.88 ms
CPP Wallclock SIMD 4 Tiled gridding 3590.65 ms
SIMD Tiled SoA with OpenMP simd pragma
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 44905 ms
CPP Wallclock Std gridding omp variant 14517.2 ms
CPP Wallclock SIMD 2 Tiled gridding 3192.87 ms
CPP Wallclock SIMD 2 Tiled gridding omp variant 6820.49 ms
CPP Wallclock SIMD 4 Tiled gridding 4393.77 ms
Single header libraries
A bottleneck in our implementation is the call
function.
Both the C++ Std library and come with single file header formats ie) implementation is in header, so the compiler will optimize code of the header file in the current CU
Single header libraries
Before inlining
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock SIMD 2 Tiled gridding 670.14 ms
After inlining
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock SIMD 2 Tiled gridding 367.054 ms
Time taken by our implementation improves close to 2x from this
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 44131.6 ms
CPP Wallclock ptr gridding 1672.64 ms
CPP Wallclock SIMD 2 Tiled gridding 1911.09 ms
CPP Wallclock gridding avx 2734.79 ms
Our SSE implementation is close to the pointer implementation. The AVX is still performing poorly than the SSE implementation.
Why is AVX performing so poorly?
The AVX implementation is performing so poorly, a perf report shows this
+ 77.92% 77.92% stsoa stsoa [.] gridding_casa_simd_4
+ 21.26% 21.26% stsoa libgcc_s.so.1 [.] __mulsc3
__mulsc3
is complex multiplication, this is added is make sure that float multiplication does not overflow into NaN.
In our AVX implementation the last grid points are gridded in the same method as the default std::complex
implementation
If we add the fcx-limited-range
flag to the source we can simplify this
+ 99.26% 98.75% stsoa stsoa [.] gridding_casa_simd_4
Alternatively we can use #pragma STDC CX_LIMITED_RANGE
if we donβt want to enable -fcx-limited-range
for the whole code.
(base) [m.nimalan@gondor build]$ ./stsoa
Access with CASA
CPP Wallclock Std gridding 1649.07 ms
CPP Wallclock ptr gridding 1648.19 ms
CPP Wallclock SIMD 2 Tiled gridding 1891.82 ms
CPP Wallclock gridding avx 1934.41 ms
Bottleneck of our intrinsic code
_mm256_storeu_ps
is transformed into vextractf128
for first 128bit vmovups
and second 128 bit rather than vmovaps
with a ymmword ptr
913 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
914 _mm256_storeu_ps (float *__P, __m256 __A)
915 {
916 *(__m256_u *)__P = __A;
36.58 vextractf128 $0x1,%ymm0,0x10(%r11,%rcx,1)
19.06 vmovups %xmm0,(%r11,%rcx,1)
Zen 3 micro architecture
AMD has pushed many improvements from Zen 1 to Zen 2, and further improvements is Zen 3 architecture. In Zen 3 architecture our SIMD Tiled SoA implementation is faster than ASKAP ptr gridding
Std gridding 1774.829 ms
ASKAP ptr gridding 1760.225 ms
SIMD SSE Tiled gridding 1406.519 ms
SIMD AVX Tiled gridding 1079.274 ms
Case Study into x86
One question that arises is that with -Ofast and all unsafe math optimization will std::complex
outperform TSoA and SoA?
No, fcx-limited-range
is the only flag needed
Is our SIMD Tiled implementation limited by our intrinsics usage
There are possibilites with vmovdup, _mm256_dp_ps
, _mm256_movehdup_ps
, _mm256_insert2f128