Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.
First, you should go read Sasha’s excellent write-up on SIMD: http://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He’s done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.
Okay, now let me explain a few things. Sasha’s looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here’s what I’m seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):
; parameters: RCX = a, RDX = b (as before) sub rsp,28h xor eax,eax ;rax = i (init to zero) mov r8d,dword ptr [rcx+8] ;r8d = a.Length test r8d,r8d jle bail_out ;r8d <= 0? mov r9d,dword ptr [rdx+8] ;r9d = b.Length loop: lea r10d,[rax+3] ;r10d = i + 3 cmp r10d,r8d ;we still have to range check the read from the array of &a[i+3] jae range_check_fail mov r11,rcx movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]} cmp r10d,r9d ;and we have to range check b[i+3], too... jae range_check_fail ;this one too :-( movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]} addps xmm0,xmm1 ;xmm0 += xmm1 movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0 add eax,4 ;i += 1 * sizeof(float) cmp r8d,eax ;a.Length > i? jg loop bail_out: add rsp,28h ret range_check_fail: call clr!JIT_RngChkFail (00007ff9`a6d46590) int 3
So the most egregious code quality issue is gone (we’re going to try and get an update out quickly: I’m saying mid-May, but things could change :-/). That gives me numbers like this:
Scalar adding numbers: 765msec
SIMD adding numbers: 210msec
MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren’t identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say “fix” :-)) the code:
static void AddPointwiseSimd(float[] a, float[] b)
{
int simdLength = Vector<float>.Length;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength)
{
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
}
for (; i < a.Length; ++i)
{
a[i] += b[i];
}
}
And now you might be expecting some great revelation. Nope. Sasha’s right: we’re missing some fairly straight-forward bound checks here. And we’re not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we’ll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you’ll have to limp along with only a 3.6x performance boost.