Floating Point: A Dark & Scary Corner

October 28, 2013, 2:31 pm

≪ Previous: RyuJIT CTP1 is available for public consumption!

Floating Point: A Dark & Scary Corner of Code Generation

Why using == with float & double is always wrong

As the dev lead for the .NET JIT compiler, I tend to see bugs coming from customers before anyone else. At least once a month, I get a bug from a customer that tends to look something like this:

This code works fine when it runs (debug, retail, x86, x64), but when it runs on (retail, debug, x64, x86) it doesn’t behave the way my good solid understanding of mathematics tells me it should. Please fix this obvious bug in the compiler immediately!

This is followed by some code snippet that includes a float or double, along with either an equality comparison, or base-10 output, or both. I’ve written a whole lot of responses which have run the gamut from “please go read the Wikipedia entry about floating point” to a detailed explanation about the binary representation of IEEE754 floating point. Let me first apologize to anyone who's gotten the "please go read Wikipedia" response. That's generally only sent when I'm tired & crabby. That said, it's an excellent article, and if you find this information interesting, you should definitely go read it.

So, in a completely self-serving effort to have a prebaked response to these bugs, let's discuss what makes floating point so fundamentally confusing & difficult to use properly. There are basically 3 characteristics of FP, all somewhat related, that cause untold confusion. The first is the concept of significant digits. The second is the fact that floating point is not a base 10 representation, but a base 2 representation. And the third is that floating point is really a polynomial representation of a number where the exponent is also variable. And because this is my story, I'm going to begin in the middle.

Problem #2: Base 10 vs. Base 2.

Back when you were learning long division in grade school/primary school/evil villain preschool, you learned that only certain rational numbers (values that can be represented as the quotient of two integers) can be accurately represented as a decimal value. Then you learned to draw bars over the repeating digits, and for most people things continued on. But if you're like me, you dug in a little deeper and figured out that the only fractions that are accurately representable in base 10 are fractions that when fully simplified, the divisor consist of only 5's and 2's. Any other factor sitting in there turns into repeating decimal values. This is because 5's and 2's are special in base 10 because they're the prime factors of 10. If I had stuck with math in college, I could probably prove this mathematically, but I went for computer science, where if it holds for all integers from -2³¹ to 2³¹, it must be true ("Proof by Exhaustion" is what my scientific computation professor called it). So when you're representing values in base 2, you can only accurately represent a fraction if the divisor's factors are 2's and nothing else, for the exact same reason. So first, the only way a number can be accurately represented in floating point is if it's representation involves nothing but sums of powers of two. There's a high level overview of problem #2. Let's move on to Problem #3.

Problem #3: Variable exponents

The IEEE floating point representation contains 3 components: a sign bit, a mantissa, and an exponent. The mantissa is basically the set of constants sitting in each entry of a very simple polynomial. I'm not really a math guy (college math cured me of that problem), but let's for the sake of discussion use m_x for the digits in the mantissa, and e for the value of the exponent. Normal IEEE single-precision floating point has a sign bit plus 31 bits of 'numeric' value expression: 7 bits for the exponent (plus a sign) and 24 digits for the mantissa. Double precision is the same representation, but with more bits of accuracy in both the exponent and the mantissa, so we’ll just stick with single-precision. The algebraic expression of a single-precision floating value is represented like this:

1*2^e + m₀*2^e-1 + m₁*2^e-2 + … + m₂₁*2^e-21 + m₂₂*2^e-22

If you consider the mostly standard way of doing multi-digit addition, this is a very similar representation, with one very important difference: you only have 24 columns to use! So if you're adding 2 different numbers, but they have the same exponent, everything works as you would expect. But if your numbers have different values in the 'exponent' field, the smaller one loses precision. As an example, let's design a similar base 10 system with 4 digits of accuracy. Adding the numbers 1999 and 5432000, both of which can be represented accurately in 4 digits (with an exponent) looks like this:

5432000
+ 1999
5433000

So I just lost the 999, because my representation doesn't allow me to represent it any more accurately than that. So while you might hope it would at least result in 5434000, the values may not exist (and I'm a little fuzzy on this particular area, but hopefully my point is clear).

Okay, now stick with me, because we're headed back to the first point.

Problem #1: Significant Digits

In high school chemistry, my teacher spent at least half the time drilling into our heads the idea of 'significant digits'. We were only allowed to report data to the right number of significant digits, which reflected the accuracy of our measuring equipment. If we had measuring equipment that would report accuracy to the gram and milliliter, we were expected to estimate the next digit. So you'd report that the goo had a mass of 21.3g and a volume of 32.1mL, but if you were reporting density, you didn't report that it was .6635514g/mL: you only had 3 significant digits of accuracy, so reporting anything beyond 3 digits is noise. The density was .664 g/mL. The final digit was actually expected to fluctuate, simply because you were estimating it based on the quality of your instruments, and your eyeballs. The same is true for IEEE floating point. The last digit of the value will fluctuate, because it doesn't round in the normal fashion, because there's nothing to round it with. So error begins to grow from that single bit, on up. The interesting thing about significant digits is that some operations drive accuracy down quickly. Multiplying large matrices can result in accuracy being completely eliminated quickly. Why do I know this? Because a couple engineers at Boeing told me many years ago, and I believed them because if you can make almost a million pounds fly, I figure you know your stuff.

Wait, so how does this affect me?

Putting these three things together lands in a place where your numeric/algebraic intuition is wrong for the abstraction. The single & double precision floating point types are abstractions, and they provide operators that mostly work the way you might expect. But those operators, when used with integer types, have a much clearer direct link to algebra. Integers are easy & make sense. Equality is not only possible, but easily, provably, correct. When accuracy is an issue, well, truncation occurs in the obvious place: division. Overflow wraps around, just like you figured out in CS201. The values that can be accurately represented land in a nice, regular, smooth cadence. And if you've got overflow & truncation understood, you're golden. Floating point, however, doesn't have a smooth, logical representation. Overflow rarely occurs, but truncation and rounding errors occur everywhere. The values that can be accurately represented are kind of irregular, and depend on the most significant digit of the value being calculated. And then there are things like NaN's, where fundamental concepts like reflexive operations fail, and denormals where accuracy becomes even weirder. Ugh!

Okay, but why shouldn’t I use ==?

Everything in our floating point representation is an approximation. So whenever you use an equality expression, the question actually being answered is "does the approximation of the first expression exactly equal the approximation of the second expression"? And how often is that actually the question you want answered? I believe the only people who ever write code where that's really what they mean are engineers validating either hardware or compilers.

Doesn't liberal use of epsilon make the problem go away?

This is sort of true, but despite the fact that the .NET framework exposes a value called "Double.Epsilon" as well as "Float.Epsilon", their values are mostly useless. The "correct" way to write an equality comparison is to do something like this:

if (Math.Abs(a - b) < epsilon)
Console.WriteLine("Good enough for government work!");

but thinking back to our variable exponent problem, the value of epsilon is a function of the least significant value represented in the larger of the two values. Something more like this:

if (Math.Abs(a - b) < LeastSignificantDigit(Math.Max(a, b)))
Console.WriteLine("Actually good enough, really!");

So why don't we just compile that in, instead of doing the silly bit-wise comparison that we do? Because there are a small number of people who understand this stuff far better than me (or my team) and they Know What They're Doing. And sometimes the algorithm where ‘approximate equality’ is needed cares that a and b are orders of magnitude different, even though one is much larger than the other. So instead, we just pretend like algebra still works on this misleading abstraction over the top of some very complex polynomial arithmetic, and I continue to resolve bug reports that arise from developers stumbling across this representation as "By Design".

In Conclusion

Well, this has been fun. I hope it’s helped folks understand a little bit more about why Floating Point Math is Hard (do NOT tell my 12 year old daughter I said that!) I'd like to apologize to the well informed floating-point people out there. I'm absolutely certain that I screwed up a number of details. Some were intentional (the implied 1 on the mantissa though it snuck into the algebraic expression, and the way the biased exponent is actually represented), others weren't (and if I knew what they were, I wouldn't have screwed them up so I don’t have any examples). Thanks for sticking with me, and good luck in your future floating point endeavors!

Kevin Frei

.NET Code Gen Dev Lead

↧

RyuJIT FAQ is on the .NET Blog

November 18, 2013, 10:04 am

≫ Next: Lies, damn lies, and benchmarks...

≪ Previous: Floating Point: A Dark & Scary Corner

Go forth and read it to see if it answers your questions: http://blogs.msdn.com/b/dotnet/archive/2013/11/18/ryujit-net-jit-compiler-ctp1-faq.aspx

One thing that's buried in the FAQ: we fixed bugs that people reported, so if you uninstall RyuJIT CTP1, re-download it and reinstall it, you'll get an ever so slightly less buggy version!

If the FAQ doesn't answer your questions, I'll answer other questions over here. Fewer people read this one, so I'll get myself into less trouble than my previous Answers to Questions :-)

-Kev

↧

Lies, damn lies, and benchmarks...

February 26, 2014, 10:07 am

≫ Next: RyuJIT CTP3: How to use SIMD

≪ Previous: RyuJIT FAQ is on the .NET Blog

Hi, Folks! We just released RyuJIT CTP2, complete with a magical graph indicating the performance of the new 64 bit JIT compiler as compared to JIT64. I figured I'd describe the benchmarks we're currently tracking in a little more detail, and maybe include some source code where it's code that it's okay to share. Before that, though, let’s see that magical graph again (Positive numbers indicate CTP2 does better that JIT64):

And, just for fun, let’s include one more (positive numbers indicate CTP2 does better than CTP1):

First things first: there is no rhyme or reason to the order of the benchmarks we're running. They're in a list in a file, so that's the order they run. The methodology used to measure performance is pretty straight forward: each benchmark is run once, as a "warm-up" run (get input files in the cache, etc...), then run 25 times. The average of those subsequent 25 runs is what's reported. The standard deviation is also calculated, in an effort to make it easier to detect noise from actual performance differences. The benchmarks are run on a Core i7 4850HQ, 4G RAM, an SSD, on-board video, running Windows 8.1. Nothing too fancy, just a relatively up-to-date piece of hardware that does a reasonable job of spanning laptop/mobile performance and workstation/server performance. Every benchmark is an IL-only, non-32-bit-preferred binary, so they'll all run with either a 32 or 64 bit runtime.

Now that you have a crystal clear understanding of how we're running the benchmarks, let's talk about them in left-to-right order. I'll warn you before I start: some benchmarks are better than others, and I have spent more time looking at some benchmarks more than others. That will become incredibly obvious as the discussion continues.

Roslyn:
this one is hopefully pretty self-explanatory. We perform a “self-build” of the Roslyn C# compiler: the Roslyn C# compiler reads in the C# source code of Roslyn, and generates the Roslyn compiler binaries. It’s using bits later than what are publicly available, and the source code isn’t publicly available, so this one is pretty hard for other folks to reproduce :-(. The timer is an 'external' timer: the self-build is launched by a process that has a timer, so the time reported includes process launch, JIT time, and code execution time. This one is probably the single largest benchmark we have. The improvement in JIT compile time (Throughput) accounts for more than half of the improved performance. Outside of improved throughput, there are places where we generate better code for Roslyn (code quality: CQ for short) and places where we’re worse. We’re continuing to look into performance of this code. Since the Roslyn team works closely with the .NET Runtime & Framework team, we have lots of experts in that code base nearby. One final note on Roslyn: it may appear at first glance that we did nothing here between CTP1 and CTP2, but the CTP1 build didn’t work properly. So we fixed a few bugs there to get where we are now.

System.XML:
this one should probably not really be included for RyuJIT CTP runs. It's running the XML parser against an input XML file. The reason the data isn't particularly interesting is because the XML parser is in System.XML.dll, which is NGen'ed, which means that it's actually just running code that JIT64 produced, where RyuJIT is only compiling the function that's calling the parser. Internally, we can use RyuJIT for NGen as well, so that's why it's there, but it's not showing anything observable for CTP releases of RyuJIT.

SharpChess:
This is an open source C# chess playing application written by Peter Hughes. You can download the latest version from http://www.sharpchess.com. We’re using version 2.5.2. It includes a very convenient mode for benchmarking which reads in a position, then prints out how many milliseconds it takes to calculate the next move. RyuJIT does a respectable job, keeping pace with JIT64 just fine, here.

V8-*:
these benchmarks are transliterations of JavaScript benchmarks that are part of the V8 JavaScript performance suite. A former JIT developer did the transliteration many years ago, and that’s about the extent of my understanding of these benchmarks, exception that I assume that v8-crypto does something with cryptography. Of the 3, Richards and Crypto have fairly innocuous licenses, so I’ve put them on my CodePlex site for all to enjoy. DeltaBlue due to lincensing restrictions can’t be hosted on CodePlex.

Bio-Mums:
This is a benchmark picked up from Microsoft Research’s “Biological Foundation” system several years ago. Since we grabbed it, they’ve open-sourced the work. Beyond that, I know it has something to do with biology, but just saying that word makes my brain shudder, so you’ll have to poke around on the web for more details.

Fractals:
Matt Grice, one of the RyuJIT Developers, wrote this about a year ago. It calculates the Mandelbrot set and a specific Julia set using a complex number struct type. With Matt’s permission, I’ve put the source code on my CodePlex site, so you can download it and
marvel at its beauty, its genius. This is a reasonable micro-benchmark for floating point basics, as well as abstraction cost. RyuJIT is pretty competitive with JIT64 on floating point basics, but RyuJIT demolishes JIT64 when it comes to abstraction costs, primarily due to Struct Promotion. The graph isn’t accurate for this benchmark, because RyuJIT is just over 100% (2X) faster than JIT64 on this one. But if that value were actually visible, nothing else would look signficant, so I just cropped the Y axis.

BZip2:
This is measuring the time spent zip’ing the IL disassembly of mscorlib.dll. RyuJIT is a few percent slower than JIT64. This benchmark predates the .NET built-in support for ZIP, so it uses the SharpZip library.

Mono-*:
We grabbed these benchmarks after CTP1 shipped. A blog commenter pointed us at them, and they were pretty easy to integrate into our test system. They come from “The Computer Language Benchmarks Game”. When we started measuring them, we were in pretty bad shape, but most have gone to neutral. We do still have a handful of losses, but we’re still working. For details about each benchmark, just hit up the original website. They’re all pretty small applications, generally stressing particular optimization paths, or particular functionality. Back when JIT64 started, it was originally tuned to optimize C++/CLI code, and focused on the Spec2000 set of tests, compiled as managed code. This generally tuned for optimizations that pay off in this set of benchmarks, as well. The only benchmark out of the batch that you won’t find on the debian website is pi-digits. The implementation from the original site used p/invoke to use the GMP high-precision numeric package. I implemented it using .NET’s BigInteger class instead. Again, you can find the
code on my codeplex site.

SciMark:
This one is pretty self-explanatory. SciMark’s been around a long time, and there’s a pretty reasonable C# port done back in Rotor days that now lives here. It’s a set of scientific computations, which means heavy floating point, and heavy array manipulation. Again, we’re doing much better than we were with CTP1, but RyuJIT still lags behind JIT64 by a bit. We’ve got some improvements that didn’t quite make it in for CTP2, but will be included in future updates.

FSharpFFT:
Last, but not least. When we released CTP1, there was some consternation about how F# would perform, because JIT64 did a better job of optimizing tail-calls than JIT32 did. Jon Harrup gave me an F# implementation of a Fast Fourier Transform, which I dumped into our benchmark suite with virtually no understanding of the code at all. We’ve made some headway, and are now beating JIT64 by a reasonable margin. One item worth mentioning for you F# fans out there: We’ve got the F# self-build working with RyuJIT, which was pretty excellent. Maybe we’ll start running that one as a benchmark, too!

There you have it: a quick run through of the benchmarks we run. Having read through it anyone that paid attention should be able to tell that I don’t write code, I just run the team :-). If folks would like more details about anything post questions below, and I’ll see what I can do to get the engineers that actually write the code to answer them.

Happy benchmarking!

-Kev

↧

RyuJIT CTP3: How to use SIMD

April 3, 2014, 4:01 pm

≫ Next: RyuJIT CTP3 minor fix

≪ Previous: Lies, damn lies, and benchmarks...

SIMD details will be forthcoming from the .NET blog, but Soma's already showed some sample code, and everything's available, so until those details are available here are directions about how to kick the SIMD tires:

Go get RyuJIT CTP3 and install it (requires x64 Windows 8.1 or Window Server 2012R2, same as before)
Set the "use RyuJIT" environment variable: set COMPLUS_AltJit=*
Now pay attention, because things diverge here a bit:
Set a new (and temporary) “enable SIMD stuff” environment variable: set COMPLUS_FeatureSIMD=1
Add the Microsoft.Bcl.Simd NuGet package to your project (you must select “include Prerelease” or use the –Pre option)
Tricky thing necessary until RyuJIT is final: Add a reference to Microsoft.Numerics.Vectors.Vector<T> to a class constructor that will be invoked BEFORE your methods that use the new Vector types. I’d suggest just putting it in your program’s entry class’s constructor. It must occur in the class constructor, not the instance constructor.
Make sure your application is actually running on x64. If you don't see protojit.dll loaded in your process (tasklist /M protojit.dll) then you've missed something here.

You can take a gander at some sample code sitting over here. It's pretty well commented, at least the VectorFloat.cs file is in the Mandelbrot demo. I spent almost as much time making sure comments were good as writing those 24 slightly different implementations of the Mandelbrot calculation. Take a look at the Microsoft.Numerics.Vectors namespace: there some stuff in Vector<T>, some stuff in VectorMath, and some stuff in plain Vector. And there are also "concrete types" for 2, 3, and 4 element floats.

One quick detail, for those of you that are trying this stuff out immediately: our plan (and we've already prototyped this) is to have Vector<T> automatically use AVX SIMD types on hardware where the performance of AVX code should be better than SSE2. Doing that properly, however, requires some changes to the .NET runtime. We're only able to support SSE2 for the CTP release because of that restriction (and time wasn't really on our side, either).

I’ll try to answer questions down below. I know this doesn’t have much detail. We’ve been really busy trying to get CTP3 out the door (it was published about 45 seconds before Jay’s slide went up). Many more details are coming. One final piece of info: SIMD isn’t the only thing we’ve been doing over the past few weeks. We also improved our benchmarks a little bit:

The gray bars are standard deviation, so my (really horrible) understanding of statistics means that delta's that are within the gray bars are pretty much meaningless. I'm not sure if we really did regress mono-binarytrees or not, but I'm guessing probably not. I do know that we improved a couple things that should help mono-fasta and mono-knucleotide, but the mono-chameneos-redux delta was unexpected.

Anyway, I hope this helps a little. I'm headed home for the evening, as my adrenaline buzz from trying to get everything done in time for Jay's talk is fading fast, and I want to watch Habib from the comfort of my couch, rather than be in traffic. Okay, I'm exaggerating: my commute is generally about 10 minutes, but I'm tired, and whiny, and there are wolves after me.

-Kev

↧

RyuJIT CTP3 minor fix

April 10, 2014, 10:10 pm

≫ Next: Quick info about a great SIMD writeup

≪ Previous: RyuJIT CTP3: How to use SIMD

Hi, folks. I'm on vacation, but I figured I'd let everyone know what the RyuJIT team has been up to since the beginning of the week. There was a pretty nasty bug that slipped through into CTP3 that I'll let someone else explain in more detail because I'm tired (something about tail-calling delegates). If you have troubles, you should uninstall CTP3, then go hit http://aka.ms/RyuJIT to download CTP3b, which includes the fix for the PowerShell Bug that folks reported.

And quickly, no, I'm not normally a work-aholic: I didn't check e-mail once since leaving work last Friday, until I needed the break from my vacation Wednesday after using a 1 man auger to dig 16 holes in my back yard. So having the engineers that actually make the stuff (and fix the bugs) call me and ask me to sign off & whatnot gave my aging back a nice break from hefting around 55 lb. pier blocks and 80 lb. sacks of gravel.

-Kev

↧

Quick info about a great SIMD writeup

April 24, 2014, 1:24 pm

≫ Next: RyuJIT CTP4: Now with more SIMD types, and better OS support!

≪ Previous: RyuJIT CTP3 minor fix

Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.

First, you should go read Sasha's excellent write-up on SIMD: http://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He's done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.

Okay, now let me explain a few things. Sasha's looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here's what I'm seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):

; parameters: RCX = a, RDX = b (as before)
    sub    rsp,28h
    xor    eax,eax                          ;rax = i (init to zero)
    mov    r8d,dword ptr [rcx+8]            ;r8d = a.Length
    test   r8d,r8d
    jle    bail_out                         ;r8d <= 0?
    mov    r9d,dword ptr [rdx+8]            ;r9d = b.Length
loop:
    lea    r10d,[rax+3]                     ;r10d = i + 3
    cmp    r10d,r8d                         ;we still have to range check the read from the array of &a[i+3]
    jae    range_check_fail
    mov    r11,rcx
    movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
    cmp    r10d,r9d                         ;and we have to range check b[i+3], too...
    jae    range_check_fail                 ;this one too :-(
    movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]}
    addps  xmm0,xmm1                        ;xmm0 += xmm1
    movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0
    add    eax,4                            ;i += 1 * sizeof(float)
    cmp    r8d,eax                          ;a.Length > i?
    jg     loop
bail_out:
    add    rsp,28h
    ret
range_check_fail:
    call   clr!JIT_RngChkFail (00007ff9`a6d46590)
    int    3

So the most egregious code quality issue is gone (we're going to try and get an update out quickly: I'm saying mid-May, but things could change :-/). That gives me numbers like this:

Scalar adding numbers: 765msec

SIMD adding numbers: 210msec

MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren't identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say "fix" :-)) the code:

static void AddPointwiseSimd(float[] a, float[] b)
{
 int simdLength = Vector<float>.Length;
 int i = 0;
 for (i = 0; i < a.Length - simdLength; i += simdLength)
 {
 Vector<float> va = new Vector<float>(a, i);
 Vector<float> vb = new Vector<float>(b, i);
 va += vb;
 va.CopyTo(a, i);
 }
 for (; i < a.Length; ++i)
 {
 a[i] += b[i];
 }
}

And now you might be expecting some great revelation. Nope. Sasha's right: we're missing some fairly straight-forward bound checks here. And we're not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we'll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you'll have to limp along with only a 3.6x performance boost.

↧

RyuJIT CTP4: Now with more SIMD types, and better OS support!

May 12, 2014, 10:49 am

≫ Next: RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

≪ Previous: Quick info about a great SIMD writeup

Hi, folks. It’s been a busy month around here. We’ve been working on all sorts of stuff that I can’t talk about right now, but in the meantime, we’ve also been responding to feedback on the SIMD types. So, since it’s busy, I’m just going to list off the details, and link to other places for more information.

Probably the biggest news is that if you install the 4.5.2 runtime (check the .NET blog for details on that), you can use RyuJIT CTP4 on Windows Vista, 7, and 8, as well as Windows Server 2008, 2008 R2, and 2012. In the CTP1 FAQ, I made mention that 4.5.1 on “downlevel” OS’es looked different from a code generation perspective. Well, that’s been addressed in the 4.5.2 update, so we’re happy to support RyuJIT CTP4 across all platforms that support 4.5.2.
Nearly all the available Vector<T> types are now accelerated! The only ones missing are Vector<uint> and Vector<ulong>. In addition, there are a handful of other methods that we are now accelerating, including the CopyTo() method, which means any performance you have measured is now completely invalidated! Wait, no I mean, any performance you measured could potentially be faster!
The fixed-size vector types are all mutable, now. This was the single biggest piece of feedback we received, so we took it.

There you have it. For now, you can download the CTP4 bits from here. The BCL SIMD NuGet package has also been updated, so update that, and you should be good to go. Same directions as before for how to use the types, enable RyuJIT, all that stuff. As always, send feedback to ryujit@microsoft.com. Happy RyuJIT-ing!

-Kev

↧

RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

October 30, 2014, 5:12 pm

≫ Next: RyuJIT and .NET 4.6

≪ Previous: RyuJIT CTP4: Now with more SIMD types, and better OS support!

Hi Folks! Yes, we understand it’s been a while since we shipped the last RyuJIT CTP. We have been working hard on improving our SIMD support and getting RyuJIT to ship quality for the next version of the .NET Framework. So without further ado, here’s a quick description of what you can expect from RyuJIT CTP5.

Correctness

We have spent a lot of time finding and fixing those last few pesky corner-case functional issues in RyuJIT. Fortunately, we have the luxury of having many internal partners with a significant managed codebase, making it easy to throw as much managed code as we can find at RyuJIT. While some of the issues we have found are legitimate bugs, others are not so clear cut. For example, we have found that JIT64 accommodates some illegal IL disallowed by the ECMA spec. Since backward compatibility is a major concern for us, we evaluate these issues on a case-by-case basis to decide if we should quirk RyuJIT to accommodate the same illegal IL.

Real-World Throughput Wins

In case you have missed the original blog post announcing the first CTP, RyuJIT beats JIT64 handily in terms of throughput while staying very competitive in terms of code quality (CQ). Recently the Bing team has tried using RyuJIT on top of 4.5.1 in some of their processing, and they see a 25% reduction in startup time in their scenario. This is the most significant real-world throughput win we have witnessed on RyuJIT thus far. :)

Code Quality

We didn’t publish any benchmark results with RyuJIT CTP4, so here are some graphs to show that we haven’t regressed CQ in RyuJIT CTP5. However, since CQ hasn’t been the focus for this CTP, we also haven’t made any significant improvements either.

These graphs follow the same basic format as previous ones. The higher the bar, the better RyuJIT CTP5 is at that benchmark. The grey area is the standard deviation, so any benchmark falling in the grey area is just noise.

What’s New in JIT Support for SIMD types?

RyuJIT CTP5 supports acceleration of the latest version of the Vector APIs available via NuGet here. This version contains a number of changes that were requested by developers.

One of the most popular requests was to publicly expose the fields of the fixed-size vector types (e.g. Vector2.X). Why wasn’t this done originally? The short answer is that it was for performance, but really it was to make it easier for the JIT to handle all the references to these types as intrinsics, and to transform them into the appropriate target instructions. It’s a tricky business, however, to determine where to allocate a local Vector instance for best efficiency:

If the instance will be primarily used in Vector intrinsics, putting it in an xmm/ymm register is the best option.
If the instance will primarily be referenced via its fields, then either putting it in memory, or separately allocating its fields to registers, is the best option.
If the instance is larger than 8 bytes (i.e. not a Vector2), and it is primarily passed as a method argument, then putting it in memory is the best option.

With CTP5 we have made the JIT a bit smarter about identifying these field accesses, analyzing the usage of the vector instance, and selecting among these options, but there is still room for improvement, so you may find that some SIMD code runs more slowly with this new release.

We’ve also improved register allocation for SIMD types, reducing a number of cases where we had unnecessary copies of vector registers.

Since we are talking about SIMD performance, it wouldn’t be fair to not include any SIMD benchmark results. We are using the sample code here as our SIMD benchmarks. (However, note that we are using an updated version of RayTracer which uses our latest Vector APIs. We’ll update the sample shortly.)

Stay tuned – we are continuing to work on performance for SIMD types, including tuning of inlining heuristics for SIMD methods, and improved dead store elimination. We’ll also be diving into the usage data from Bing and other internal partners to see how we can improve the performance of RyuJIT even more on both throughput and CQ.

In case you need them again, you can refer to this blog post for the instructions to turn on RyuJIT, and this blog post for instructions on using SIMD. Note that if you are running on the 4.5.2 version of the .NET Framework, you can use RyuJIT CTP5 on Windows Vista, 7, 8, and 8.1 as well as Windows Server 2008, 2008 R2, 2012, and 2012 R2. However, RyuJIT CTP5 currently doesn't work on Visual Studio "14" CTP4. You don't need it anyway, since RyuJIT is enabled by default on Visual Studio "14" CTP4. :) (The version of RyuJIT in Visual Studio "14" CTP4 is slightly older than this CTP, but not by much.)

↧

RyuJIT and .NET 4.6

May 27, 2015, 2:14 pm

≫ Next: Announcing the release of RyuJIT for x64!

≪ Previous: RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

It’s been a while since we’ve posted here, and a lot has happened with RyuJIT since the last post.

First and foremost, RyuJIT is the default x64 JIT for .NET 4.6! As a result, there is no need to use a RyuJIT CTP to try out RyuJIT. If you install the .NET 4.6 Preview (included in Visual Studio 2015 Preview), or later build, such as Visual Studio 2015 RC, you will be using it. See here or the .NET blog for more details.

One thing we need to point out, though: if you have previously configured your machine to use a RyuJIT CTP (using the instruction here), and you update to .NET 4.6, you need to make sure the COMPLUS_AltJit environment variable and the HKLM\SOFTWARE\Microsoft\.NETFramework\AltJit registry value do not exist. If they do exist, every .NET application will fail to run. You don’t want that!

Update: the specific failure you will get is exception number -2146233082 (in hex: 0x80131506), otherwise known as COR_E_EXECUTIONENGINE or "Internal CLR error". This will be the process exit code. For console applications, you can see this by examining %ERRORLEVEL% after the process terminates (if you are working in a CMD.EXE window). You may also see a dialog box stating "<application> has stopped working", and if you expand "View problem details", you will see "Fault Module Name: clr.dll" and "Exception Code: 80131506". You will also see an entry in the Event Viewer, Application log, source ".NET Runtime", listing this error code. Note that there are other reasons this exception number could be generated; a stray AltJit configuration is not the only reason for this exception.

Note that this information has also been published as KB3065367 here.

Secondly, RyuJIT is now open source! See here for details.

We hope you enjoy the benefits of all the work we’ve put into RyuJIT. If you have any feedback on RyuJIT, send us mail at ryujit@microsoft.com. We want to hear from you!

↧

Announcing the release of RyuJIT for x64!

July 20, 2015, 4:13 pm

≫ Next: RyuJIT FAQ is on the .NET Blog

≪ Previous: RyuJIT and .NET 4.6

After many years of work, RyuJIT for x64 has now been released! Thanks to many of you trying it out over the course of the last couple years via our CTP releases, and giving us valuable feedback.

You can read all about the .NET Framework 4.6 release (which includes RyuJIT) here: http://blogs.msdn.com/b/dotnet/archive/2015/07/20/announcing-net-framework-4-6.aspx.

You can see Soma's blog post on the Visual Studio 2015 and .NET Framework 4.6 releases here: http://blogs.msdn.com/b/somasegar/archive/2015/07/20/visual-studio-2015-and-net-4-6-available-for-download.aspx.

Enjoy these releases, and keep the feedback coming!

↧

RyuJIT FAQ is on the .NET Blog

November 18, 2013, 2:04 am

≫ Next: Lies, damn lies, and benchmarks…

≪ Previous: Announcing the release of RyuJIT for x64!

Go forth and read it to see if it answers your questions: http://blogs.msdn.com/b/dotnet/archive/2013/11/18/ryujit-net-jit-compiler-ctp1-faq.aspx

One thing that’s buried in the FAQ: we fixed bugs that people reported, so if you uninstall RyuJIT CTP1, re-download it and reinstall it, you’ll get an ever so slightly less buggy version!

If the FAQ doesn’t answer your questions, I’ll answer other questions over here. Fewer people read this one, so I’ll get myself into less trouble than my previous Answers to Questions :-)

-Kev

↧

Lies, damn lies, and benchmarks…

February 26, 2014, 2:07 am

≫ Next: RyuJIT CTP3: How to use SIMD

≪ Previous: RyuJIT FAQ is on the .NET Blog

Hi, Folks! We just released RyuJIT CTP2, complete with a magical graph indicating the performance of the new 64 bit JIT compiler as compared to JIT64. I figured I’d describe the benchmarks we’re currently tracking in a little more detail, and maybe include some source code where it’s code that it’s okay to share. Before that, though, let’s see that magical graph again (Positive numbers indicate CTP2 does better that JIT64):

And, just for fun, let’s include one more (positive numbers indicate CTP2 does better than CTP1):

First things first: there is no rhyme or reason to the order of the benchmarks we’re running. They’re in a list in a file, so that’s the order they run. The methodology used to measure performance is pretty straight forward: each benchmark is run once, as a “warm-up” run (get input files in the cache, etc…), then run 25 times. The average of those subsequent 25 runs is what’s reported. The standard deviation is also calculated, in an effort to make it easier to detect noise from actual performance differences. The benchmarks are run on a Core i7 4850HQ, 4G RAM, an SSD, on-board video, running Windows 8.1. Nothing too fancy, just a relatively up-to-date piece of hardware that does a reasonable job of spanning laptop/mobile performance and workstation/server performance. Every benchmark is an IL-only, non-32-bit-preferred binary, so they’ll all run with either a 32 or 64 bit runtime.

Now that you have a crystal clear understanding of how we’re running the benchmarks, let’s talk about them in left-to-right order. I’ll warn you before I start: some benchmarks are better than others, and I have spent more time looking at some benchmarks more than others. That will become incredibly obvious as the discussion continues.

Roslyn:
this one is hopefully pretty self-explanatory. We perform a “self-build” of the Roslyn C# compiler: the Roslyn C# compiler reads in the C# source code of Roslyn, and generates the Roslyn compiler binaries. It’s using bits later than what are publicly available, and the source code isn’t publicly available, so this one is pretty hard for other folks to reproduce :-(. The timer is an ‘external’ timer: the self-build is launched by a process that has a timer, so the time reported includes process launch, JIT time, and code execution time. This one is probably the single largest benchmark we have. The improvement in JIT compile time (Throughput) accounts for more than half of the improved performance. Outside of improved throughput, there are places where we generate better code for Roslyn (code quality: CQ for short) and places where we’re worse. We’re continuing to look into performance of this code. Since the Roslyn team works closely with the .NET Runtime & Framework team, we have lots of experts in that code base nearby. One final note on Roslyn: it may appear at first glance that we did nothing here between CTP1 and CTP2, but the CTP1 build didn’t work properly. So we fixed a few bugs there to get where we are now.

System.XML:
this one should probably not really be included for RyuJIT CTP runs. It’s running the XML parser against an input XML file. The reason the data isn’t particularly interesting is because the XML parser is in System.XML.dll, which is NGen’ed, which means that it’s actually just running code that JIT64 produced, where RyuJIT is only compiling the function that’s calling the parser. Internally, we can use RyuJIT for NGen as well, so that’s why it’s there, but it’s not showing anything observable for CTP releases of RyuJIT.

Happy benchmarking!

-Kev

↧

RyuJIT CTP3: How to use SIMD

April 3, 2014, 9:01 am

≫ Next: RyuJIT CTP3 minor fix

≪ Previous: Lies, damn lies, and benchmarks…

SIMD details will be forthcoming from the .NET blog, but Soma’s already showed some sample code, and everything’s available, so until those details are available here are directions about how to kick the SIMD tires:

Go get RyuJIT CTP3 and install it (requires x64 Windows 8.1 or Window Server 2012R2, same as before)
Set the “use RyuJIT” environment variable: set COMPLUS_AltJit=*

Update: make sure you un-set this when you are done testing RyuJIT! See here for more details.

Now pay attention, because things diverge here a bit:

Set a new (and temporary) “enable SIMD stuff” environment variable: set COMPLUS_FeatureSIMD=1

Add the Microsoft.Bcl.Simd NuGet package to your project (you must select “include Prerelease” or use the –Pre option)

Tricky thing necessary until RyuJIT is final: Add a reference to Microsoft.Numerics.Vectors.Vector<T> to a class constructor that will be invoked BEFORE your methods that use the new Vector types. I’d suggest just putting it in your program’s entry class’s constructor. It must occur in the class constructor, not the instance constructor.

Make sure your application is actually running on x64. If you don’t see protojit.dll loaded in your process (tasklist /M protojit.dll) then you’ve missed something here.

You can take a gander at some sample code sitting over here. It’s pretty well commented, at least the VectorFloat.cs file is in the Mandelbrot demo. I spent almost as much time making sure comments were good as writing those 24 slightly different implementations of the Mandelbrot calculation. Take a look at the Microsoft.Numerics.Vectors namespace: there some stuff in Vector<T>, some stuff in VectorMath, and some stuff in plain Vector. And there are also “concrete types” for 2, 3, and 4 element floats.

One quick detail, for those of you that are trying this stuff out immediately: our plan (and we’ve already prototyped this) is to have Vector<T> automatically use AVX SIMD types on hardware where the performance of AVX code should be better than SSE2. Doing that properly, however, requires some changes to the .NET runtime. We’re only able to support SSE2 for the CTP release because of that restriction (and time wasn’t really on our side, either).

The gray bars are standard deviation, so my (really horrible) understanding of statistics means that delta’s that are within the gray bars are pretty much meaningless. I’m not sure if we really did regress mono-binarytrees or not, but I’m guessing probably not. I do know that we improved a couple things that should help mono-fasta and mono-knucleotide, but the mono-chameneos-redux delta was unexpected.

Anyway, I hope this helps a little. I’m headed home for the evening, as my adrenaline buzz from trying to get everything done in time for Jay’s talk is fading fast, and I want to watch Habib from the comfort of my couch, rather than be in traffic. Okay, I’m exaggerating: my commute is generally about 10 minutes, but I’m tired, and whiny, and there are wolves after me.

-Kev

↧

RyuJIT CTP3 minor fix

April 10, 2014, 3:10 pm

≫ Next: Quick info about a great SIMD writeup

≪ Previous: RyuJIT CTP3: How to use SIMD

Hi, folks. I’m on vacation, but I figured I’d let everyone know what the RyuJIT team has been up to since the beginning of the week. There was a pretty nasty bug that slipped through into CTP3 that I’ll let someone else explain in more detail because I’m tired (something about tail-calling delegates). If you have troubles, you should uninstall CTP3, then go hit http://aka.ms/RyuJIT to download CTP3b, which includes the fix for the PowerShell Bug that folks reported.

And quickly, no, I’m not normally a work-aholic: I didn’t check e-mail once since leaving work last Friday, until I needed the break from my vacation Wednesday after using a 1 man auger to dig 16 holes in my back yard. So having the engineers that actually make the stuff (and fix the bugs) call me and ask me to sign off & whatnot gave my aging back a nice break from hefting around 55 lb. pier blocks and 80 lb. sacks of gravel.

-Kev

↧

Quick info about a great SIMD writeup

April 24, 2014, 6:24 am

≫ Next: RyuJIT CTP4: Now with more SIMD types, and better OS support!

≪ Previous: RyuJIT CTP3 minor fix

Hi, folks. I wanted to put together a more coherent version of a random Twitter conversation I had with Sasha Goldshtein.

First, you should go read Sasha’s excellent write-up on SIMD: http://blogs.microsoft.co.il/sasha/2014/04/22/c-vectorization-microsoft-bcl-simd/. He’s done an excellent job of talking about how scalar & vector operations compare, and really dug in deep to performance of the generated code.

Okay, now let me explain a few things. Sasha’s looking at the code quality coming out of the preview: we know this code quality is less than optimal. Here’s what I’m seeing from the same code with a compiler I built about 30 minutes ago (Siva, one of the JIT devs, just got CopyTo recognized as an intrinsic function):

; parameters: RCX = a, RDX = b (as before)
    sub    rsp,28h
    xor    eax,eax                          ;rax = i (init to zero)
    mov    r8d,dword ptr [rcx+8]            ;r8d = a.Length
    test   r8d,r8d
    jle    bail_out                         ;r8d <= 0?
    mov    r9d,dword ptr [rdx+8]            ;r9d = b.Length
loop:
    lea    r10d,[rax+3]                     ;r10d = i + 3
    cmp    r10d,r8d                         ;we still have to range check the read from the array of &a[i+3]
    jae    range_check_fail
    mov    r11,rcx
    movupd xmm0,xmmword ptr [r11+rax*4+10h] ;xmm0 = {a[i], a[i+1], a[i+2], a[i+3]}
    cmp    r10d,r9d                         ;and we have to range check b[i+3], too...
    jae    range_check_fail                 ;this one too :-(
    movupd xmm1,xmmword ptr [rdx+rax*4+10h] ;xmm1 = {b[i], b[i+1], b[i+2], b[i+3]}
    addps  xmm0,xmm1                        ;xmm0 += xmm1
    movupd xmmword ptr [r11+rax*4+10h],xmm0 ;{a[i], a[i+1], a[i+2], a[i+3]} = xmm0
    add    eax,4                            ;i += 1 * sizeof(float)
    cmp    r8d,eax                          ;a.Length > i?
    jg     loop
bail_out:
    add    rsp,28h
    ret
range_check_fail:
    call   clr!JIT_RngChkFail (00007ff9`a6d46590)
    int    3

So the most egregious code quality issue is gone (we’re going to try and get an update out quickly: I’m saying mid-May, but things could change :-/). That gives me numbers like this:

Scalar adding numbers: 765msec

SIMD adding numbers: 210msec

MUCH better. But Sasha makes another interesting point: why did we create two versions of the scalar loop, but left those pesky bounds checks in the SIMD loop? Well, for starters, the two functions aren’t identical. The SIMD version will actually throw a bounds check failure exception. To fix that, we actually need to change (some might say “fix” :-)) the code:

static void AddPointwiseSimd(float[] a, float[] b)
{
 int simdLength = Vector<float>.Length;
 int i = 0;
 for (i = 0; i < a.Length - simdLength; i += simdLength)
 {
 Vector<float> va = new Vector<float>(a, i);
 Vector<float> vb = new Vector<float>(b, i);
 va += vb;
 va.CopyTo(a, i);
 }
 for (; i < a.Length; ++i)
 {
 a[i] += b[i];
 }
}

And now you might be expecting some great revelation. Nope. Sasha’s right: we’re missing some fairly straight-forward bound checks here. And we’re not dual-versioning the loop, like RyuJIT does for normal scalar loops. Both items are on our to-do list. But the original code Sasha wrote would probably never have those optimizations. Hopefully, we’ll get our bounds-check elimination work strengthened enough to make this code truly impressive. Until then, you’ll have to limp along with only a 3.6x performance boost.

↧

RyuJIT CTP4: Now with more SIMD types, and better OS support!

May 12, 2014, 3:49 am

≫ Next: RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

≪ Previous: Quick info about a great SIMD writeup

Probably the biggest news is that if you install the 4.5.2 runtime (check the .NET blog for details on that), you can use RyuJIT CTP4 on Windows Vista, 7, and 8, as well as Windows Server 2008, 2008 R2, and 2012. In the CTP1 FAQ, I made mention that 4.5.1 on “downlevel” OS’es looked different from a code generation perspective. Well, that’s been addressed in the 4.5.2 update, so we’re happy to support RyuJIT CTP4 across all platforms that support 4.5.2.
Nearly all the available Vector<T> types are now accelerated! The only ones missing are Vector<uint> and Vector<ulong>. In addition, there are a handful of other methods that we are now accelerating, including the CopyTo() method, which means any performance you have measured is now completely invalidated! Wait, no I mean, any performance you measured could potentially be faster!
The fixed-size vector types are all mutable, now. This was the single biggest piece of feedback we received, so we took it.

-Kev

↧

RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

October 30, 2014, 10:12 am

≫ Next: RyuJIT and .NET 4.6

≪ Previous: RyuJIT CTP4: Now with more SIMD types, and better OS support!

Correctness

Real-World Throughput Wins

Code Quality

What’s New in JIT Support for SIMD types?

RyuJIT CTP5 supports acceleration of the latest version of the Vector APIs available via NuGet here. This version contains a number of changes that were requested by developers.

If the instance will be primarily used in Vector intrinsics, putting it in an xmm/ymm register is the best option.
If the instance will primarily be referenced via its fields, then either putting it in memory, or separately allocating its fields to registers, is the best option.
If the instance is larger than 8 bytes (i.e. not a Vector2), and it is primarily passed as a method argument, then putting it in memory is the best option.

We’ve also improved register allocation for SIMD types, reducing a number of cases where we had unnecessary copies of vector registers.

In case you need them again, you can refer to this blog post for the instructions to turn on RyuJIT, and this blog post for instructions on using SIMD. Note that if you are running on the 4.5.2 version of the .NET Framework, you can use RyuJIT CTP5 on Windows Vista, 7, 8, and 8.1 as well as Windows Server 2008, 2008 R2, 2012, and 2012 R2. However, RyuJIT CTP5 currently doesn’t work on Visual Studio “14” CTP4. You don’t need it anyway, since RyuJIT is enabled by default on Visual Studio “14” CTP4. (The version of RyuJIT in Visual Studio “14” CTP4 is slightly older than this CTP, but not by much.)

↧

RyuJIT and .NET 4.6

May 27, 2015, 7:14 am

≫ Next: Announcing the release of RyuJIT for x64!

≪ Previous: RyuJIT CTP5: Getting closer to shipping, and with better SIMD support

It’s been a while since we’ve posted here, and a lot has happened with RyuJIT since the last post.

Update: the specific failure you will get is exception number -2146233082 (in hex: 0x80131506), otherwise known as COR_E_EXECUTIONENGINE or “Internal CLR error”. This will be the process exit code. For console applications, you can see this by examining %ERRORLEVEL% after the process terminates (if you are working in a CMD.EXE window). You may also see a dialog box stating “<application> has stopped working”, and if you expand “View problem details”, you will see “Fault Module Name: clr.dll” and “Exception Code: 80131506″. You will also see an entry in the Event Viewer, Application log, source “.NET Runtime”, listing this error code. Note that there are other reasons this exception number could be generated; a stray AltJit configuration is not the only reason for this exception.

Note that this information has also been published as KB3065367 here.

Secondly, RyuJIT is now open source! See here for details.

We hope you enjoy the benefits of all the work we’ve put into RyuJIT. If you have any feedback on RyuJIT, send us mail at ryujit@microsoft.com. We want to hear from you!

↧

Announcing the release of RyuJIT for x64!

July 20, 2015, 9:13 am

≫ Next: RyuJIT tutorial at CGO and PLDI conferences

≪ Previous: RyuJIT and .NET 4.6

After many years of work, RyuJIT for x64 has now been released! Thanks to many of you trying it out over the course of the last couple years via our CTP releases, and giving us valuable feedback.

You can read all about the .NET Framework 4.6 release (which includes RyuJIT) here: http://blogs.msdn.com/b/dotnet/archive/2015/07/20/announcing-net-framework-4-6.aspx.

You can see Soma’s blog post on the Visual Studio 2015 and .NET Framework 4.6 releases here: http://blogs.msdn.com/b/somasegar/archive/2015/07/20/visual-studio-2015-and-net-4-6-available-for-download.aspx.

Enjoy these releases, and keep the feedback coming!

↧

RyuJIT tutorial at CGO and PLDI conferences

February 2, 2016, 7:15 am

≪ Previous: Announcing the release of RyuJIT for x64!

For those of you interested in learning more about the internals of RyuJIT, Carol Eidt will be giving a tutorial at CGO in Barcelona on March 13. She will also be giving the tutorial at PLDI in Santa Barbara, CA in June. She’ll describe the internal representation and compilation phases, and then walk through the process of adding a simple feature.

The early registration deadline for CGO is tomorrow, February 3, so if you are interested in attending, now’s the time to act.

↧