-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
It is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481.
Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of n:
static void copy(int[,,] a, int[,,] b, int n)
{
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
a[i,j,k] = b[i,j,k];
}
}
}
}
The x64 code generated for the inner loop is currently:
G_M4797_IG05:
mov r11d, eax
sub r11d, dword ptr [rcx+28]
cmp r11d, dword ptr [rcx+16]
jae G_M4797_IG09
mov esi, r9d
sub esi, dword ptr [rcx+32]
cmp esi, dword ptr [rcx+20]
jae G_M4797_IG09
mov edi, dword ptr [rcx+20]
imul rdi, r11
mov r11, rsi
add r11, rdi
mov esi, r10d
sub esi, dword ptr [rcx+36]
cmp esi, dword ptr [rcx+24]
jae G_M4797_IG09
mov edi, dword ptr [rcx+24]
imul rdi, r11
mov r11, rsi
add r11, rdi
lea r11, bword ptr [rcx+4*r11+40]
mov esi, eax
sub esi, dword ptr [rdx+28]
cmp esi, dword ptr [rdx+16]
jae SHORT G_M4797_IG09
mov edi, r9d
sub edi, dword ptr [rdx+32]
cmp edi, dword ptr [rdx+20]
jae SHORT G_M4797_IG09
mov ebx, dword ptr [rdx+20]
imul rbx, rsi
mov rsi, rdi
add rsi, rbx
mov edi, r10d
sub edi, dword ptr [rdx+36]
cmp edi, dword ptr [rdx+24]
jae SHORT G_M4797_IG09
mov ebx, dword ptr [rdx+24]
imul rbx, rsi
mov rsi, rdi
add rsi, rbx
mov esi, dword ptr [rdx+4*rsi+40]
mov dword ptr [r11], esi
inc r10d
cmp r10d, r8d
jl G_M4797_IG05
Things to note:
- There are 6 bounds checks, one for each dimension of each array. 4 of these (for
iandj) should be hoisted out of the innerkloop. The other 2 could be removed by loop cloning. - MD arrays, in contrast to single-dimension so-called "SZ" arrays, can have a non-zero lower bound in each dimension. Thus, the index must subtract that before comparing against the dimension length. This feature is rarely used (C# doesn't have syntax to create it; you need to use the one System.Array.CreateInstance function that allows for specifying the lower bounds), but must be supported.
- The base address of the
kdimension data (a byref into the array) could be hoisted out of the inner loop.
Thus, we should be able to generate something like this in the inner loop:
G_M4797_IG05:
mov esi, dword ptr [rdx+4*r10]
mov dword ptr [rcx+4*r10], esi
inc r10d
cmp r10d, r8d
jl G_M4797_IG05
Currently, after importation, the array access looks like:
[000033] -A-X-------- * ASG int
[000032] ---X---N---- +--* IND int
[000031] ---X-------- | \--* ARR_ELEM[,,] byref
[000021] ------------ | +--* LCL_VAR ref V00 arg0
[000022] ------------ | +--* LCL_VAR int V03 loc0
[000023] ------------ | +--* LCL_VAR int V04 loc1
[000024] ------------ | \--* LCL_VAR int V05 loc2
[000030] ---X-------- \--* IND int
[000029] ---X-------- \--* ARR_ELEM[,,] byref
[000025] ------------ +--* LCL_VAR ref V01 arg1
[000026] ------------ +--* LCL_VAR int V03 loc0
[000027] ------------ +--* LCL_VAR int V04 loc1
[000028] ------------ \--* LCL_VAR int V05 loc2
This form persists until lowering, when it is expanded to:
N001 ( 1, 1) [000021] ------------ t21 = LCL_VAR ref V00 arg0 u:1 $80
N002 ( 1, 1) [000022] ------------ t22 = LCL_VAR int V03 loc0 u:3 $140
N003 ( 1, 1) [000023] ------------ t23 = LCL_VAR int V04 loc1 u:3 $141
N004 ( 1, 1) [000024] ------------ t24 = LCL_VAR int V05 loc2 u:3 $142
[000092] -c---------- t92 = CNS_INT long 0
/--* t21 ref
+--* t22 int
[000093] ---X-------- t93 = * ARR_INDEX[i, , ] int
[000094] ------------ t94 = LCL_VAR ref V00 arg0
/--* t92 long
+--* t93 int
+--* t94 ref
[000095] ---X-------- t95 = * ARR_OFFSET[i, , ] long
[000096] ------------ t96 = LCL_VAR ref V00 arg0
/--* t96 ref
+--* t23 int
[000097] ---X-------- t97 = * ARR_INDEX[*,j, ] int
[000098] ------------ t98 = LCL_VAR ref V00 arg0
/--* t95 long
+--* t97 int
+--* t98 ref
[000099] ---X-------- t99 = * ARR_OFFSET[*,j, ] long
[000100] ------------ t100 = LCL_VAR ref V00 arg0
/--* t100 ref
+--* t24 int
[000101] ---X-------- t101 = * ARR_INDEX[*,*,k] int
[000102] ------------ t102 = LCL_VAR ref V00 arg0
/--* t99 long
+--* t101 int
+--* t102 ref
[000103] ---X-------- t103 = * ARR_OFFSET[*,*,k] long
[000104] ------------ t104 = LCL_VAR ref V00 arg0
/--* t104 ref
+--* t103 long
[000105] ---X-------- t105 = * LEA(b+(i*4)+40) byref
N007 ( 1, 1) [000025] ------------ t25 = LCL_VAR ref V01 arg1 u:1 $81
N008 ( 1, 1) [000026] ------------ t26 = LCL_VAR int V03 loc0 u:3 $140
N009 ( 1, 1) [000027] ------------ t27 = LCL_VAR int V04 loc1 u:3 $141
N010 ( 1, 1) [000028] ------------ t28 = LCL_VAR int V05 loc2 u:3 $142
[000106] -c---------- t106 = CNS_INT long 0
/--* t25 ref
+--* t26 int
[000107] ---X-------- t107 = * ARR_INDEX[i, , ] int
[000108] ------------ t108 = LCL_VAR ref V01 arg1
/--* t106 long
+--* t107 int
+--* t108 ref
[000109] ---X-------- t109 = * ARR_OFFSET[i, , ] long
[000110] ------------ t110 = LCL_VAR ref V01 arg1
/--* t110 ref
+--* t27 int
[000111] ---X-------- t111 = * ARR_INDEX[*,j, ] int
[000112] ------------ t112 = LCL_VAR ref V01 arg1
/--* t109 long
+--* t111 int
+--* t112 ref
[000113] ---X-------- t113 = * ARR_OFFSET[*,j, ] long
[000114] ------------ t114 = LCL_VAR ref V01 arg1
/--* t114 ref
+--* t28 int
[000115] ---X-------- t115 = * ARR_INDEX[*,*,k] int
[000116] ------------ t116 = LCL_VAR ref V01 arg1
/--* t113 long
+--* t115 int
+--* t116 ref
[000117] ---X-------- t117 = * ARR_OFFSET[*,*,k] long
[000118] ------------ t118 = LCL_VAR ref V01 arg1
/--* t118 ref
+--* t117 long
[000119] -c-X-------- t119 = * LEA(b+(i*4)+40) byref
/--* t119 byref
N012 ( 21, 14) [000030] ---X-------- t30 = * IND int <l:$103, c:$102>
/--* t105 byref
+--* t30 int
[000084] -A-X-------- * STOREIND int
There are several things to note, here:
- This expansion happens in Lower, which is after all the optimization phases (VN, CSE, hoisting). Thus, this directly represents the code that will be generated. The expansion is done this way mostly to ensure the register allocator has visibility into all the required register lifetimes.
- The
ARR_INDEXnode both creates an "effective index" (the program index minus the dimension's lower bounds) as well as performs a bounds check. This means that the bounds check can't be eliminated, or hoisted (assuming any optimization was done on this form), as theARR_INDEXnode is still required to create the effective index value.
In addition, not shown here, System.Array has a number of methods used on multi-dimensional arrays, that are not well optimized, e.g., are not all inlined or treated as invariant. Specifically, Rank, GetLowerBound, GetUpperBound, GetLength.
Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for a[i][j], an array bounds check on a[i].Length must be done for every varying i in a loop before accessing a[i][j]. However, for a[i,j] it may be possible to hoist the a[i,] dimension check.
For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):
if ((n <= a.GetLength(0)) && (n <= a.GetLength(1)) && (n <= a.GetLength(2))) {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
a[i,j,k] = b[i,j,k]; // no bounds check in loop
}
}
}
} else {
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
for (int k = 0; k < n; k++) {
a[i,j,k] = b[i,j,k]; // ALL bounds check
}
}
}
}
Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run.
Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops.
Thus, to improve code generation:
- MD array expansion should happen earlier, either in morph, or possibly as a separate phase after all the loop optimization phases. We don't want the loop optimizer (or any phase, for that matter) to do pattern matching on these complex lowered forms to reconstruct an understanding of the operation.
- Split the
ARR_INDEXnode into one that does a bounds check, and one that computes the effective index. To avoid duplicating the effective index calculation (needed by both parts), this might require a temp and comma, e.g.,
comma(
tmp1 = dim-effective-index(array, dim, index),
// tmp1 = index - array.GetLowerBound(dim)
// if the lower bound is known to be zero, this is just `tmp1 = index`. In that case, if `index` is simple
// (lcl_var/lcl_field/constant), we don't need tmp1 or this comma.
comma(
dim-bounds-check(array, tmp1),
// compare tmp1 against array.GetLength(dim). If we know it's in range, it can be removed. It can be CSE'ed/hoisted.
tmp1 // result is tmp1, the effective index
)
)
- (1) and (2) should allow CSE and hoisting to kick in on common and loop-invariant sub-expressions.
- Loop cloning should implement cloning of loops with MD array accesses, to eliminate unnecessary bounds checks. It should leverage the invariance of all loop dimension bounds to reduce code expansion.
- Consider marking MD array helpers as AggressiveInlining to improve their optimization (as prototyped here).
You can see a description of the internal data layout format in the comments for the RawArrayData type and the CORINFO_Array type.
category:cq
theme:md-arrays
skill-level:expert
cost:large