You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Local heap optimizations on Arm64
1. When not required to zero the allocated space for local heap (for sizes up to 64 bytes) - do not emit zeroing sequence. Instead do stack probing and adjust stack pointer:
```diff
- stp xzr, xzr, [sp,#-16]!
- stp xzr, xzr, [sp,#-16]!
- stp xzr, xzr, [sp,#-16]!
- stp xzr, xzr, [sp,#-16]!
+ ldr wzr, [sp],#-64
```
2. For sizes less than one `PAGE_SIZE` use `ldr wzr, [sp], #-amount` that does probing at `[sp]` and allocates the space at the same time. This saves one instruction for such local heap allocations:
```diff
- ldr wzr, [sp]
- sub sp, sp, #208
+ ldr wzr, [sp],#-208
```
Use `ldp tmpReg, xzr, [sp], #-amount` when the offset not encodable by post-index variant of `ldr`:
```diff
- ldr wzr, [sp]
- sub sp, sp, dotnet#512
+ ldp x0, xzr, [sp],#-512
```
3. Allow non-loop zeroing (i.e. unrolled sequence) for sizes up to 128 bytes (i.e. up to `LCLHEAP_UNROLL_LIMIT`). This frees up two internal integer registers for such cases:
```diff
- mov w11, #128
- ;; bbWeight=0.50 PerfScore 0.25
-G_M44913_IG19: ; gcrefRegs=00F9 {x0 x3 x4 x5 x6 x7}, byrefRegs=0000 {}, byref, isz
stp xzr, xzr, [sp,#-16]!
- subs x11, x11, #16
- bne G_M44913_IG19
+ stp xzr, xzr, [sp,#-112]!
+ stp xzr, xzr, [sp,#16]
+ stp xzr, xzr, [sp,#32]
+ stp xzr, xzr, [sp,#48]
+ stp xzr, xzr, [sp,#64]
+ stp xzr, xzr, [sp,#80]
+ stp xzr, xzr, [sp,#96]
```
4. Do zeroing in ascending order of the effective address:
```diff
- mov w7, #96
-G_M49279_IG13:
stp xzr, xzr, [sp,#-16]!
- subs x7, x7, #16
- bne G_M49279_IG13
+ stp xzr, xzr, [sp,#-80]!
+ stp xzr, xzr, [sp,#16]
+ stp xzr, xzr, [sp,#32]
+ stp xzr, xzr, [sp,#48]
+ stp xzr, xzr, [sp,#64]
```
In the example, the zeroing is done at `[initialSp-16], [initialSp-96], [initialSp-80], [initialSp-64], [initialSp-48], [initialSp-32]` addresses. The idea here is to allow a CPU to detect the sequential `memset` to `0` pattern and switch into write streaming mode.
0 commit comments