Description
Orchard CMS is one of the most complicated benchmark in our test suite (way more complicated than TE benchmarks) so, presumably, it's closer to real-world. Currently, ARM64 is twice slower than x64 (comparable HW) on it while for TE benchmark the same ARM64 hardware is typically 1.2-1.5x faster than that x64 AMD machine.
The perf
trace is pointing to JIT_NewS_MP_FastPortable
:
(sorted by 'self' aks exclusive time).
The annoted asm for it:
Or (output of bjdump -d
):
000000000033fbf0 <_Z24JIT_NewS_MP_FastPortableP21CORINFO_CLASS_STRUCT_>:
33fbf0: a9bf7bfd stp x29, x30, [sp, #-16]!
33fbf4: 910003fd mov x29, sp
33fbf8: aa0003e8 mov x8, x0
33fbfc: 90001920 adrp x0, 663000 <_ZTV12EEJitManager+0x50>
33fc00: f947dc01 ldr x1, [x0, #4024]
33fc04: 913ee000 add x0, x0, #0xfb8
33fc08: d63f0020 blr x1
33fc0c: d53bd049 mrs x9, tpidr_el0
33fc10: b940050a ldr w10, [x8, #4]
33fc14: f8606929 ldr x9, [x9, x0]
33fc18: a945ad20 ldp x0, x11, [x9, #88]
33fc1c: cb00016b sub x11, x11, x0
33fc20: eb0a017f cmp x11, x10
33fc24: 54000082 b.cs 33fc34 <_Z24JIT_NewS_MP_FastPortableP21CORINFO_CLASS_STRUCT_+0x44> // b.hs, b.nlast
33fc28: aa0803e0 mov x0, x8
33fc2c: a8c17bfd ldp x29, x30, [sp], #16
33fc30: 14000006 b 33fc48 <_Z7JIT_NewP21CORINFO_CLASS_STRUCT_>
33fc34: 8b0a000a add x10, x0, x10
33fc38: f9002d2a str x10, [x9, #88]
33fc3c: f9000008 str x8, [x0]
33fc40: a8c17bfd ldp x29, x30, [sp], #16
33fc44: d65f03c0 ret
If I read this correctly, we spend most of them time reading data from TLS (to get current allocator ?) - I'd expect to see time spent inside the fallback call to gc, but it's not (if the trace is accurate). Another hint from the general trace is that we don't really spend a lot of time in GC itself.
PS: what is blr x1
call ? I presume it should be part of the GetThread()