Quick info about a great SIMD writeup

Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.

First, you should go read Sasha’s excellent write-up on SIMD: http://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He’s done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.

Okay, now let me explain a few things. Sasha’s looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here’s what I’m seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):

; parameters: RCX = a, RDX = b (as before)
    sub    rsp,28h
    xor    eax,eax                          ;rax = i (init to zero)
    mov    r8d,dword ptr [rcx+8]            ;r8d = a.Length
    test   r8d,r8d
    jle    bail_out                         ;r8d <= 0?
    mov    r9d,dword ptr [rdx+8]            ;r9d = b.Length
loop:
    lea    r10d,[rax+3]                     ;r10d = i + 3
    cmp    r10d,r8d                         ;we still have to range check the read from the array of &a[i+3]
    jae    range_check_fail
    mov    r11,rcx
    movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
    cmp    r10d,r9d                         ;and we have to range check b[i+3], too...
    jae    range_check_fail                 ;this one too :-(
    movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]}
    addps  xmm0,xmm1                        ;xmm0 += xmm1
    movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0
    add    eax,4                            ;i += 1 * sizeof(float)
    cmp    r8d,eax                          ;a.Length > i?
    jg     loop
bail_out:
    add    rsp,28h
    ret
range_check_fail:
    call   clr!JIT_RngChkFail (00007ff9`a6d46590)
    int    3

So the most egregious code quality issue is gone (we’re going to try and get an update out quickly: I’m saying mid-May, but things could change :-/). That gives me numbers like this:

Scalar adding numbers: 765msec

SIMD adding numbers: 210msec

MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren’t identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say “fix” :-)) the code:

static void AddPointwiseSimd(float[] a, float[] b)
{
 int simdLength = Vector<float>.Length;
 int i = 0;
 for (i = 0; i < a.Length - simdLength; i += simdLength)
 {
 Vector<float> va = new Vector<float>(a, i);
 Vector<float> vb = new Vector<float>(b, i);
 va += vb;
 va.CopyTo(a, i);
 }
 for (; i < a.Length; ++i)
 {
 a[i] += b[i];
 }
}

And now you might be expecting some great revelation. Nope. Sasha’s right: we’re missing some fairly straight-forward bound checks here. And we’re not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we’ll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you’ll have to limp along with only a 3.6x performance boost.

Quick info about a great SIMD writeup

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112