Skip to content

RyuJIT: Improving multi-dimensional array accesses #60785

@BruceForstall

Description

@BruceForstall

It is well known that the RyuJIT generated multi-dimensional (hereafter, MD) array access is inefficient. Specifically, true MD array access, not "jagged" array access. This is a query of the properly "theme:md-arrays" tagged issues. For example, #8569 and #5481.

Here's a simple example of a function copying a 3-dimensional array, assuming zero-based indices and a length in each dimension of n:

static void copy(int[,,] a, int[,,] b, int n)
{
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k];
            }
        }
    }
}

The x64 code generated for the inner loop is currently:

G_M4797_IG05:
       mov      r11d, eax
       sub      r11d, dword ptr [rcx+28]
       cmp      r11d, dword ptr [rcx+16]
       jae      G_M4797_IG09
       mov      esi, r9d
       sub      esi, dword ptr [rcx+32]
       cmp      esi, dword ptr [rcx+20]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+20]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       mov      esi, r10d
       sub      esi, dword ptr [rcx+36]
       cmp      esi, dword ptr [rcx+24]
       jae      G_M4797_IG09
       mov      edi, dword ptr [rcx+24]
       imul     rdi, r11
       mov      r11, rsi
       add      r11, rdi
       lea      r11, bword ptr [rcx+4*r11+40]
       mov      esi, eax
       sub      esi, dword ptr [rdx+28]
       cmp      esi, dword ptr [rdx+16]
       jae      SHORT G_M4797_IG09
       mov      edi, r9d
       sub      edi, dword ptr [rdx+32]
       cmp      edi, dword ptr [rdx+20]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+20]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      edi, r10d
       sub      edi, dword ptr [rdx+36]
       cmp      edi, dword ptr [rdx+24]
       jae      SHORT G_M4797_IG09
       mov      ebx, dword ptr [rdx+24]
       imul     rbx, rsi
       mov      rsi, rdi
       add      rsi, rbx
       mov      esi, dword ptr [rdx+4*rsi+40]
       mov      dword ptr [r11], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Things to note:

  1. There are 6 bounds checks, one for each dimension of each array. 4 of these (for i and j) should be hoisted out of the inner k loop. The other 2 could be removed by loop cloning.
  2. MD arrays, in contrast to single-dimension so-called "SZ" arrays, can have a non-zero lower bound in each dimension. Thus, the index must subtract that before comparing against the dimension length. This feature is rarely used (C# doesn't have syntax to create it; you need to use the one System.Array.CreateInstance function that allows for specifying the lower bounds), but must be supported.
  3. The base address of the k dimension data (a byref into the array) could be hoisted out of the inner loop.

Thus, we should be able to generate something like this in the inner loop:

G_M4797_IG05:
       mov      esi, dword ptr [rdx+4*r10]
       mov      dword ptr [rcx+4*r10], esi
       inc      r10d
       cmp      r10d, r8d
       jl       G_M4797_IG05

Currently, after importation, the array access looks like:

[000033] -A-X--------              *  ASG       int                     
[000032] ---X---N----              +--*  IND       int                  
[000031] ---X--------              |  \--*  ARR_ELEM[,,] byref          
[000021] ------------              |     +--*  LCL_VAR   ref    V00 arg0
[000022] ------------              |     +--*  LCL_VAR   int    V03 loc0
[000023] ------------              |     +--*  LCL_VAR   int    V04 loc1
[000024] ------------              |     \--*  LCL_VAR   int    V05 loc2
[000030] ---X--------              \--*  IND       int                  
[000029] ---X--------                 \--*  ARR_ELEM[,,] byref          
[000025] ------------                    +--*  LCL_VAR   ref    V01 arg1
[000026] ------------                    +--*  LCL_VAR   int    V03 loc0
[000027] ------------                    +--*  LCL_VAR   int    V04 loc1
[000028] ------------                    \--*  LCL_VAR   int    V05 loc2

This form persists until lowering, when it is expanded to:

N001 (  1,  1) [000021] ------------        t21 =    LCL_VAR   ref    V00 arg0         u:1 $80
N002 (  1,  1) [000022] ------------        t22 =    LCL_VAR   int    V03 loc0         u:3 $140
N003 (  1,  1) [000023] ------------        t23 =    LCL_VAR   int    V04 loc1         u:3 $141
N004 (  1,  1) [000024] ------------        t24 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000092] -c----------        t92 =    CNS_INT   long   0
                                                  /--*  t21    ref    
                                                  +--*  t22    int    
               [000093] ---X--------        t93 = *  ARR_INDEX[i, , ] int   
               [000094] ------------        t94 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t92    long   
                                                  +--*  t93    int    
                                                  +--*  t94    ref    
               [000095] ---X--------        t95 = *  ARR_OFFSET[i, , ] long  
               [000096] ------------        t96 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t96    ref    
                                                  +--*  t23    int    
               [000097] ---X--------        t97 = *  ARR_INDEX[*,j, ] int   
               [000098] ------------        t98 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t95    long   
                                                  +--*  t97    int    
                                                  +--*  t98    ref    
               [000099] ---X--------        t99 = *  ARR_OFFSET[*,j, ] long  
               [000100] ------------       t100 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t100   ref    
                                                  +--*  t24    int    
               [000101] ---X--------       t101 = *  ARR_INDEX[*,*,k] int   
               [000102] ------------       t102 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t99    long   
                                                  +--*  t101   int    
                                                  +--*  t102   ref    
               [000103] ---X--------       t103 = *  ARR_OFFSET[*,*,k] long  
               [000104] ------------       t104 =    LCL_VAR   ref    V00 arg0         
                                                  /--*  t104   ref    
                                                  +--*  t103   long   
               [000105] ---X--------       t105 = *  LEA(b+(i*4)+40) byref 
N007 (  1,  1) [000025] ------------        t25 =    LCL_VAR   ref    V01 arg1         u:1 $81
N008 (  1,  1) [000026] ------------        t26 =    LCL_VAR   int    V03 loc0         u:3 $140
N009 (  1,  1) [000027] ------------        t27 =    LCL_VAR   int    V04 loc1         u:3 $141
N010 (  1,  1) [000028] ------------        t28 =    LCL_VAR   int    V05 loc2         u:3 $142
               [000106] -c----------       t106 =    CNS_INT   long   0
                                                  /--*  t25    ref    
                                                  +--*  t26    int    
               [000107] ---X--------       t107 = *  ARR_INDEX[i, , ] int   
               [000108] ------------       t108 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t106   long   
                                                  +--*  t107   int    
                                                  +--*  t108   ref    
               [000109] ---X--------       t109 = *  ARR_OFFSET[i, , ] long  
               [000110] ------------       t110 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t110   ref    
                                                  +--*  t27    int    
               [000111] ---X--------       t111 = *  ARR_INDEX[*,j, ] int   
               [000112] ------------       t112 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t109   long   
                                                  +--*  t111   int    
                                                  +--*  t112   ref    
               [000113] ---X--------       t113 = *  ARR_OFFSET[*,j, ] long  
               [000114] ------------       t114 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t114   ref    
                                                  +--*  t28    int    
               [000115] ---X--------       t115 = *  ARR_INDEX[*,*,k] int   
               [000116] ------------       t116 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t113   long   
                                                  +--*  t115   int    
                                                  +--*  t116   ref    
               [000117] ---X--------       t117 = *  ARR_OFFSET[*,*,k] long  
               [000118] ------------       t118 =    LCL_VAR   ref    V01 arg1         
                                                  /--*  t118   ref    
                                                  +--*  t117   long   
               [000119] -c-X--------       t119 = *  LEA(b+(i*4)+40) byref 
                                                  /--*  t119   byref  
N012 ( 21, 14) [000030] ---X--------        t30 = *  IND       int    <l:$103, c:$102>
                                                  /--*  t105   byref  
                                                  +--*  t30    int    
               [000084] -A-X--------              *  STOREIND  int   

There are several things to note, here:

  1. This expansion happens in Lower, which is after all the optimization phases (VN, CSE, hoisting). Thus, this directly represents the code that will be generated. The expansion is done this way mostly to ensure the register allocator has visibility into all the required register lifetimes.
  2. The ARR_INDEX node both creates an "effective index" (the program index minus the dimension's lower bounds) as well as performs a bounds check. This means that the bounds check can't be eliminated, or hoisted (assuming any optimization was done on this form), as the ARR_INDEX node is still required to create the effective index value.

In addition, not shown here, System.Array has a number of methods used on multi-dimensional arrays, that are not well optimized, e.g., are not all inlined or treated as invariant. Specifically, Rank, GetLowerBound, GetUpperBound, GetLength.

Also, loop cloning does not currently implement support for cloning loops with MD indexing expressions. It's interesting to note that MD arrays are simpler in one way compared to jagged arrays: the bounds for each dimension is invariant on an array object, whereas for jagged arrays, the array bounds for "child" arrays might change. That is, for a[i][j], an array bounds check on a[i].Length must be done for every varying i in a loop before accessing a[i][j]. However, for a[i,j] it may be possible to hoist the a[i,] dimension check.

For example, in the above example, cloning could generate a single cloned "slow path" loop with all the cloning condition checks (essentially, bounds checks) outside the outer loop. In pseudo-code, this would be (ignoring non-zero lower bounds):

if ((n <= a.GetLength(0)) && (n <= a.GetLength(1)) && (n <= a.GetLength(2))) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // no bounds check in loop
            }
        }
    }
} else {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            for (int k = 0; k < n; k++) {
                a[i,j,k] = b[i,j,k]; // ALL bounds check
            }
        }
    }
}

Note that the slow path might be slower than optimal if only some cloning conditions fail, but code expansion would be less, and we assume that the slow path is rarely if ever run.

Currently, with jagged arrays, loop cloning might clone at every level of the loop nest, which creates far too much code expansion for nested loops.

Thus, to improve code generation:

  1. MD array expansion should happen earlier, either in morph, or possibly as a separate phase after all the loop optimization phases. We don't want the loop optimizer (or any phase, for that matter) to do pattern matching on these complex lowered forms to reconstruct an understanding of the operation.
  2. Split the ARR_INDEX node into one that does a bounds check, and one that computes the effective index. To avoid duplicating the effective index calculation (needed by both parts), this might require a temp and comma, e.g.,
comma(
  tmp1 = dim-effective-index(array, dim, index),
     // tmp1 = index - array.GetLowerBound(dim)
     // if the lower bound is known to be zero, this is just `tmp1 = index`. In that case, if `index` is simple
     // (lcl_var/lcl_field/constant), we don't need tmp1 or this comma.
  comma(
      dim-bounds-check(array, tmp1),
          // compare tmp1 against array.GetLength(dim). If we know it's in range, it can be removed. It can be CSE'ed/hoisted.
      tmp1 // result is tmp1, the effective index
  )
)
  1. (1) and (2) should allow CSE and hoisting to kick in on common and loop-invariant sub-expressions.
  2. Loop cloning should implement cloning of loops with MD array accesses, to eliminate unnecessary bounds checks. It should leverage the invariance of all loop dimension bounds to reduce code expansion.
  3. Consider marking MD array helpers as AggressiveInlining to improve their optimization (as prototyped here).

You can see a description of the internal data layout format in the comments for the RawArrayData type and the CORINFO_Array type.

category:cq
theme:md-arrays
skill-level:expert
cost:large

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIoptimizationtenet-performancePerformance related issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions