Introduction
One argument often made by those who dislike managed code is along the lines of “managed code can never be as fast as native code, because managed code has to do array bounds checks.” Of course, this isn’t precisely true – it would be more accurate to say that “managed code must ensure that any indexing outside of an array’s bounds raises an appropriate exception.” If a compiler can prove statically that an array index operation is safe, it doesn’t need to generate a dynamic test.
We’re not starting from scratch here. There’s been a lot of academic (and industrial) research on bounds check elimination, and various managed code systems have implemented some subset of these techniques. Putting “array bounds check elimination” into bing.com yielded a large number of relevant papers, many of which I’ve read and enjoyed; I’d imagine a competitor’s search site would do the same J.
This blog post will explore what the CLR’s just-in-time compilers do and do not do in this area. I’ll of course highlight the good cases, but I’m also going to be brutally honest, and expose many examples where we could potentially eliminate a range check, but don’t. The reader (and, for that matter, the author, who didn’t implement this stuff himself) should keep in mind an important constraint: these are, in fact dynamic JIT compilers, so any extra optimization that slows the compiler down must be balanced against the gains of that optimization. Even when we run the JIT “offline”, via the NGEN tool, users are sensitive to compiler throughput. So there are many things we could do, but all take CLR developer effort, and some of them use up our precious compilation time budget. That excuse being made, it’s up to us to be clever and figure out how to do some of these optimizations efficiently in the compiler, and we’ll certainly try to do more of that in the future.
The JIT compilers for x86 and x64 are currently quite different code bases, and I’ll describe the behavior of each here. The reader should note however, that we intend to unify them at some point in the not-too-distant future. The x86 JIT is faster in terms of compilation speed; the x64 JIT is slower, but does more interesting optimizations. Our plan is to extend the x86 codebase to generate x64 code, and incorporate some of the x64 JIT’s optimizations without unduly increasing compilation time. In any case, performance characteristics of JITted code on x64 platforms is likely to change significantly when this unification is achieved.
When I show examples where we don’t eliminate bounds checks, I will when possible give advice that will help you stay within boundary of idioms for which we can. I’ll discuss things we might be able to do in the future, but I’m not in a position to give any scheduling commitments on when these might be done. I can say that any reader feedback on prioritization will be taken into account.
Code Gen for Range Checks
Before we start considering when we eliminate range checks, let’s see what the code generated for a range check looks like. Here is bounds-check code generated by the CLR’s x86 JIT for an example array index expression a[i]:
IN0001: 000003 cmp EDX, dword ptr [ECX+4] // a in ECX, i in EDX
IN0002: 000006 jae SHORT G_M60672_IG03 // unsigned comparison
In the first instruction, EDX contains the array index, and ECX + 4 is the address of the length field of the array. We compare these, and jump if the index is greater than or equal to the length. The jump target, not shown, raises a System.IndexOutOfRangeException. A sharp-eyed reader might wonder: the semantics require not only that the index value is less than the array length, but also that it is at least zero. Isn’t that two checks? How did they get away with only one comparison and branch? The answer is that we (like many other systems) take advantage of the wonders of unsigned arithmetic – the x86 “jae” instruction interprets its arguments as unsigned integers (it’s the unsigned equivalent of “jge”, if that’s more familiar to some readers). The type of the length of an array, and an expression used to index into an array, is Int32, not UInt32. So the maximum value for either of these is 231-1. Further, we know that the array length will be non-negative. So if we convert the array length to a UInt32, it doesn’t affect its value. The index value, however, might be negative. If it is, casting its bit pattern to UInt32 yields a value that is at least 231. So both cases, when the index value is negative, or when it is larger than the array length, are handled by the same test.
In an NGEN image, we try to separate out code we expect to never be executed (code that is so “cold” that it’s at absolute zero!), hoping to increase working set density, especially during startup. We expect bounds-check failures to be in this category, so we put the basic blocks for failure cases on cold pages.
Bounds-check removal cases
Now we’ll examine some test cases, starting with some simple ones.
Simple cases
The good news is that we do eliminate bounds checks for what we believe to be the most common form of array accesses: accesses that occur within a loop over all the indices of the loop. So, there is no dynamic range check for the array access in this program:
static void Test_SimpleAscend(int[] a) {
for (int i = 0; i < a.Length; i++)
a[i] = i; // We get this.
}
Unfortunately, we do not eliminate the range check for the descending version of this loop:
static void Test_SimpleDescend(int[] a) {
for (int i = a.Length - 1; i >= 0; i--)
a[i] = i; // We DO NOT get this.
}
Some older programmers learned to write loops like this, because on some early architectures (e.g., DEC PDP-8, if memory serves) there was a hardware addressing mode that did auto-decrement, but not auto-increment. They may have passed this habit down to middle-aged programmers (of which I am one), and so on. There’s also a somewhat more currently-valid argument that hardware generally supports a comparison to zero without requiring the value zero to be placed in a register. In any case, while the JIT compiler should arguably eliminate the bounds check for the descending form of the loop, we don’t today, and the cost of the bounds check probably outweighs any of the other advantages. So:
· Advice 1: if you have the choice between using an ascending or descending loop to access an array, choose ascending.
I’ve put the array access on the left-hand side of an assignment in both these examples, but it works independently of the context in which the array index expression appears (as long as it’s within the loop, of course).
Do we track equalities with the length of a newly allocated array?
Here is a case in which the x86 JIT does not eliminate the bounds check:
static int[] Test_ArrayCopy1(int n) {
int[] ia = new int[n];
for (int i = 0; i < n; i++)
ia[i] = i; // We do not get this one.
return ia;
}
No excuses here: there’s no reason not to get this, the JIT compiler ought to know that n is the length of the newly allocated array in ia. In fact, the author of such code might think he was doing the JIT compiler a favor, since comparison with a local variable n might seem cheaper than comparison with ia.Length (though this isn’t really true on Intel machines). But in our system, at least today, this sort of transformation is counterproductive for the x86 JIT, since it prevents the bounds check from being eliminated. We may well extend our compiler(s) to track this sort of value equivalence in the future. For now, though, you should follow this piece of practical advice:
· Advice 2: When possible, use “a.Length” to bound a loop whose index variable is used to index into “a”.
The x64 JIT does eliminate the range checks here, by hoisting a test outside the loop, comparing n with ia.Length. If this check fails, it throws an IndexOutOfRangeException. This is somewhat problematic, since without this optimization the program would execute ia.Length iterations of the loop before throwing an exception, and strict language semantics would require those to be executed if they could possibly have a side-effect visible outside the method (which this example does not in fact have – though proving it requires your compiler to do enough escape analysis to know that the allocated array that is written to has not leaked outside the method). This semantic ambiguity is the subject of some internal debate, and we’ll eventually reach consensus on how/when to incorporate such tests in a unified JIT,or whether we need to ensure strict semantics, perhaps by generating multiple copies of loop bodies, as we’ll discuss below. (It’s interesting to note that the hoisted test and throw would be justified by assuming the CompilationRelaxationsAttribute defined in section I.12.6.4 of the ECMA CLI specification for bounds-check error exceptions everywhere – whereas the specification requires it to be given explicitly.) In any case, we should emphasize that, as far as we know, this is a “theoretical” concern only – we don’t know of any actual customer code whose correctness is affected by this issue.
Redundant array accesses
OK, while we’re slightly embarrassed by the previous “multiple names for the length” case, let’s cheer ourselves up with something we do well. We’re pretty good at eliminating redundant bounds checks. In the method:
static void Test_SimpleRedundant(int[] a, int i) {
k = a[i];
k = k + a[i];
}
bounds-check code is generated for the first instance of “a[i]”, but not the second. In fact, the x86 JIT treats it as a common subexpression, and the first result is re-used. And this works not just within “basic blocks” – it can work across control flow, as demonstrated by:
static void Test_RedundantExBB(int[] a, int i, bool b) {
k = a[i];
if (b) {
k = k + a[i];
} else {
k = k - a[i];
}
}
As before, the first “a[i]” gets a bounds check, but the two subsequent occurrences of “a[i]” do not. The x86 JIT also treats the expression as a common subexpression, re-using the result from the first read of “a[i]”.
It is not the case that bounds check elimination only works in the x86 JIT when the result is a common subexpression. Consider this variation of the first case:
static void Test_RedundantNotCse(int[] a, int i, int j) {
k = a[i];
a[j] = i;
k = k + a[i];
}
The JIT compiler obviously can’t tell whether “i” and “j” will have the same value at runtime. But it can tell that they might, and that if they do, the “a[i]” on the last line will return the value written there on the second line. So we cannot treat the “a[i]” expressions on the first and last lines of the body as common subexpressions. But the assignment on the second line can’t affect the length of the array “a,” so in fact the bounds check for the first line “covers” the “a[i]” on the third line – the generated code accesses the array without a bounds check (in both JITs).
Arrays as IEnumerable
Arrays implement the IEnumerable interface, which raises a reasonable question: if you enumerate over the elements of an array using C#’s foreach construct, do you get bounds checks? For example:
static int Test_Foreach(int[] ia) {
int sum = 0;
foreach (int i in ia) {
sum += i;
}
return sum;
}
Happily, we do eliminate the bounds checks in this case. However, there is a little quirk here: of the cases listed, this one is the only one that is sensitive to whether the original source program (C#, in this case) was compiled to IL with the /optimize flag. The default for csc, the C# compiler, is not to optimize, and in this mode it produces somewhat more verbose IL for the range check that doesn’t fit the pattern that the JIT compiler looks for. So:
· Advice 3: if you’re worried about performance, and your compiler has an optimization flag, uh, use it!
Arrays in global locations; concurrency
Here’s a case where we don’t eliminate the bounds check, but where we aren’t too embarrassed by this failure:
static int[] v;
…
static void Test_ArrayInGlobal() {
for (int i = 0; i < v.Length; i++)
v[i] = i;
}
At first glance, this seems exactly the same as our first, simplest example, Test_SimpleAscend. The difference is that Test_SimpleAscend took an array argument, whereas Test_ArrayInGlobal’s array is accessed via a static variable, accessible to other threads. This makes static elimination of the bounds check for “v[i]” at the very least dicey. Let’s say we did, and that “v” initially holds an array of length 100. On the iteration when “i” reaches (say) 80, we check “i < v.Length”, and it’s still true. Now another thread sets “v” to an array whose length is only 50. If we go ahead with the array store without a dynamic check, we’re writing off the end of the array – type-safety and security are lost, game over. (Obviously, the same reasoning would apply for an array held in any location accessible to multiple threads – an object field, element of another array, anything not local to the running thread.)
So we don’t do this, for good and solid reasons. If we cared enough, there is a technique that would allow us to eliminate these bounds checks. But it would require us to couple otherwise-unrelated optimizations. As it happens, the code for accessing a static variable in the presence of app domains can be moderately costly, so it’s good to treat those as common-subexpression candidates, and the x86 JIT does in this case (the x64 JIT does not). So the optimizer in essence synthesizes a local variable to hold the array. If we do this, then we are back in the Test_SimpleAscend situation, and the bounds-check elimination is legal. But doing the bounds-check elimination requires that the static variable be read once into a local. So it’s at least a bit complicated.
Parallel arrays
Next we consider a case that involves what are sometimes called “parallel arrays” (in the sense of their structure, not in the sense that they will be used by multiple threads):
static int Test_TwoArrays(int[] ia1, int[] ia2) {
// The programmer knows a precondition: ia1.Length == ia2.Length
int sum = 0;
for (int i = 0; i < ia1.Length; i++) {
// Below we eliminate the ia1 check, but not the one for ia2.
sum += (ia1[i] + ia2[i]);
}
return sum;
}
Much as with Test_ArrayCopy1, the x64 JIT hoists a test comparing ia2.Length and ia1.Length, immediately throwing the bounds-check exception if the test fails. If the test succeeds, range checks for both array accesses in the loop are eliminated. The same comments about the semantic issues with such a test apply. The x86 JIT takes a more “purist” approach: it does not hoist a test, so it only eliminates the bounds check for the access to the array ia1 whose Length bounds the index variable.
We could resolve the two approaches. The mechanisms proposed for this sort of problem in the research literature have the common property that they require, at least in some cases, generating code for the loop multiple times, under different assumptions, and synthesizing some sort of test to determine which version of the loop should be executed – this is essentially the test that the x64 JIT is already creating. Generally, bounds check exceptions are rare – if the programmer wrote the code above, he or she had some reason to believe that the index expression “ia2[i]” was safe. So we could synthesize a test on that basis. In our case above, if the compiler proved that neither argument variable “ia1” or “ia2” was modified in the loop, then a test “ia2.Length >= ia1.Length” (the one the x64 JIT generates) outside the loop would allow us to execute an optimized version of the loop, with no bounds checks for either array access. If this test failed, however, we’d need to execute an unoptimized version of the loop to be completely semantically correct. You’d have to evaluate this test carefully, since it’s code that doesn’t appear in the original program. In particular in this case, you’d have to worry about whether either of “ia1” or “ia2” were null. If they are, you want the null pointer exception to occur at the appropriate point in execution, not in some code the compiler made up. So the synthesized test would have to include null tests, and take the unoptimized path if either argument is null.
As we’ve discussed, the x64 JIT generates the test, but not the unoptimized version of the loop – it throws the exception “early” in that case. Under the “purist” viewpoint, this is incorrect because if the test fails, the semantics require the program to execute some number of loop iterations before throwing the exception, and those iterations might have side effects. In many cases, we might be able to prove that the loop body does not have side effects, and therefore use the x64 JIT’s strategy with semantic blessing. For example, a loop that computed the sum of the loop elements into a local would side-effect only that local variable – if the exception causes control flow to leave the method, the value of that local becomes meaningless.
Many other patterns are amenable to this sort of synthesized test. An alternative form of Test_TwoArrays might have passed the shared length of both arrays as a separate argument, and used that as the loop bound. We could do something similar, synthesizing a test of that loop bound vs. both array lengths.
Explicit Assertions
Another suggestion that has been made is to allow the programmer to provide the relevant test in the form of a contract assertion (of a flavor that would be executed in all execution modes, not just in a debug mode). This would essentially provide semantic “permission” to fail immediately if the test is violated, avoiding the need to have an unoptimized version of the loop. There are many things to be said for this sort of proposal: they can allow bounds checks to be eliminated in situations more complicated than those for which the compiler could easily infer a test, and the invariants they express are often useful program documentation as well. Still, I also worry somewhat about such proposals. In many common cases, it’s easy enough to infer the test, so we should avoid requiring the programmer to add assertions in the easy cases. More importantly, if the programmer adds an assertion expecting it to eliminate a bounds check, how does the tool chain indicate whether he or she has been successful? And, if not, why not? These sorts of issues merit some more thought.
Still another path would be to have a custom annotation like [OptimizeForSpeedNotSpace], allowing the programmer to tell us that the performance of this method is important enough that we should apply optimizations that we wouldn’t generally apply because they increase code size – i.e., especially aggressive inlining, loop unrolling, loop body replication/specialization for the reasons discussed here, or for other forms of specialization.
The right strategy in this area is obviously a little muddled. Constructive feedback is welcome!
Copy loop
Here’s another example, somewhat similar to the Test_TwoArrays case:
static int[] Test_ArrayCopy2(int[] ia1) {
// An array copy loop operation.
int[] res = new int[ia1.Length];
for (int i = 0; i < res.Length; i++) {
res[i] = ia1[i];
}
return res;
}
As you might expect from previous examples, since we use the length of “res” as the loop bound, we eliminate the bounds check for the access to “res”. But we do not eliminate the check for the access to “ia1”. As in Test_ArrayCopy1, to eliminate this we’d need to do a better job of tracking equivalences of array lengths with local variables or other array lengths. We don’t do this today, but it’s certainly a plausible enhancement we might do. The x86 JIT leaves the bounds check in for the access “ia1[i]”, while the x64 JIT hoists a bounds-check out of the loop, as discussed above (and with the same difficulties discussed above).
While it would be nice for us to eliminate the bounds checks cases like this, if you’re copying arrays, there are many reasons to use the built-in Array.Copy routine rather than writing an explicit copy loop like those that appear in these examples:
· Advice 4: when you’re copying medium-to-large arrays, use Array.Copy, rather than explicit copy loops. First, all your range checks will be “hoisted” to a single check outside the loop. If the arrays contain object references, you will also get efficient “hoisting” of two more expenses related to storing into arrays of object types: the per-element “store checks” related to array covariance can often be eliminated by a check on the dynamic types of the arrays, and garbage-collection-related write barriers will be aggregated and become much more efficient. Finally, we will able to use more efficient “memcpy”-style copy loops. (And in the coming multicore world, perhaps even employ parallelism if the arrays are big enough!)
Multi-dimensional Arrays
The CLR, and C#, support real multi-dimensional arrays – in contrast to C++ or Java, which (directly) support only one-dimensional arrays. To get two-dimensional arrays, you have to simulate them, either through classes that represent the 2-d array as a large 1-d array, and do the appropriate index arithmetic, or as an “array-of-arrays.” In the latter case, even if they are allocated originally to form a “rectangular” 2-d array, it’s hard for a compiler to prove that the array stays rectangular, so bounds check on accesses to the “inner” arrays are hard to prove.
With true multi-dimensional arrays, the array lengths in each dimension are immutable (just as the length of a regular 1-d array is). This makes removing of bounds checks in each dimension more tractable. A related advantage is that indexing calculations become easier when the array is known to be “rectangular.” (With a good optimizer and appropriately aggressive inlining, C++ template-class-based simulations of multidimensional arrays can get similar indexing calculation code.)
Unfortunately, we aren’t yet able to remove any range checks for accesses in multi-dimensional arrays, even in simple cases like this:
static int Test_2D(int[,] mat) {
int sum = 0;
for (int i = 0; i < mat.GetLength(0); i++) {
for (int j = 0; j < mat.GetLength(1); j++) {
sum += mat[i, j];
}
}
return sum;
}
The “mat.GetLength(k)” method returns the length of “mat” in the kth dimension. We’ll clearly need to eliminate bounds checks for multi-dimensional array accesses if we want to generate reasonable code for, say, a matrix multiplication.
· Advice 5: Until we get this right, I would suggest that .NET users do what many C++ numerical programmers do: write a class to implement your n-dimensional array. This would be represented as a 1-dimensional array, and the relevant accessors would convert n indices into 1 via appropriate multiplications. We almost certainly wouldn’t eliminate the bounds check into the 1-d array, but at least we’d only do one check!
Conclusions
First, let’s accentuate the positive: we do eliminate bounds checks in some very common cases, and the costs of bounds checks usually aren’t that great when we don’t eliminate them. And, as I mentioned at the beginning, we have to keep in mind that the compiler we’re talking about is a dynamic JIT compiler, so we must carefully balance adding extra optimization that slows the compiler against the gains of that optimization. Still, if we don’t eliminate a bounds check that we should have in a small, tight loop that’s important to the performance of your program, I doubt you’ll find these excuses very satisfying. I hope this blog post convinces you that we’re well aware of the problems. The future almost certainly holds some mechanism for applying extra compilation effort to methods whose performance matters a lot, either by doing the extra work in some form of offline build-lab compilation, or by using profile-directed feedback, user annotations of hot methods, or other heuristics. When we can do extra compiler work, bounds-check elimination will be one of the problems we address.