1 year ago

#341370

test-img

AAA

Performance differences in SIMD operations across different CPU architectures

I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.

The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t elements and a range B[l,r], where l and r are multiples of 16. The function returns the sum of the elements within B[l,r].

This is my code:

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:

Accessing the fields of a __m128i variable in a portable way

For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]. I compared two CPU architectures; Apple M1 and Intel(R) Xeon(R) Silver 4110. I used the same source code for both cases, and the same compiler (g++) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math. The only difference is that for Apple M1 I had to include an extra header called sse2neon.h that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2 flag for this case.

These are the results I obtained with the Apple M1 processor:

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

As you see, there is an important difference between using SIMD instructions, and not using them.

These are the results with the Intel(R) Xeon(R) Silver 4110 processor:

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

In this case, there is not a big difference.

I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?

Update:

I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel Architecture, and indeed I obtained reductions similar to Apple. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.

x86

cpu-architecture

simd

apple-m1

intrinsics

0 Answers

Your Answer

Accepted video resources