JIT_NewS_MP_FastPortable is slow for Orchard CMS benchmark on Linux-arm64

Orchard CMS is one of the most complicated benchmark in our test suite (way more complicated than TE benchmarks) so, presumably, it's closer to real-world. Currently, ARM64 is twice slower than x64 (comparable HW) on it while for TE benchmark the same ARM64 hardware is typically 1.2-1.5x faster than that x64 AMD machine.
The `perf` trace is pointing to `JIT_NewS_MP_FastPortable`: 
![image](https://github.com/dotnet/runtime/assets/523221/8683ce64-56dd-441e-8d40-5833fcef5496)

(sorted by 'self' aks exclusive time).

The annoted asm for it:
![image](https://github.com/dotnet/runtime/assets/523221/6927650b-ba32-45cd-a35b-1330b5223252)
Or (output of `bjdump -d`):
```asm
000000000033fbf0 <_Z24JIT_NewS_MP_FastPortableP21CORINFO_CLASS_STRUCT_>:
  33fbf0:	a9bf7bfd 	stp	x29, x30, [sp, #-16]!
  33fbf4:	910003fd 	mov	x29, sp
  33fbf8:	aa0003e8 	mov	x8, x0
  33fbfc:	90001920 	adrp	x0, 663000 <_ZTV12EEJitManager+0x50>
  33fc00:	f947dc01 	ldr	x1, [x0, #4024]
  33fc04:	913ee000 	add	x0, x0, #0xfb8
  33fc08:	d63f0020 	blr	x1
  33fc0c:	d53bd049 	mrs	x9, tpidr_el0
  33fc10:	b940050a 	ldr	w10, [x8, #4]
  33fc14:	f8606929 	ldr	x9, [x9, x0]
  33fc18:	a945ad20 	ldp	x0, x11, [x9, #88]
  33fc1c:	cb00016b 	sub	x11, x11, x0
  33fc20:	eb0a017f 	cmp	x11, x10
  33fc24:	54000082 	b.cs	33fc34 <_Z24JIT_NewS_MP_FastPortableP21CORINFO_CLASS_STRUCT_+0x44>  // b.hs, b.nlast
  33fc28:	aa0803e0 	mov	x0, x8
  33fc2c:	a8c17bfd 	ldp	x29, x30, [sp], #16
  33fc30:	14000006 	b	33fc48 <_Z7JIT_NewP21CORINFO_CLASS_STRUCT_>
  33fc34:	8b0a000a 	add	x10, x0, x10
  33fc38:	f9002d2a 	str	x10, [x9, #88]
  33fc3c:	f9000008 	str	x8, [x0]
  33fc40:	a8c17bfd 	ldp	x29, x30, [sp], #16
  33fc44:	d65f03c0 	ret
```

If I read this correctly, we spend most of them time reading data from TLS ([to get current allocator](https://github.com/dotnet/runtime/blob/6aca07ae5e939f3f957fbe7ea94ca1cce6035024/src/coreclr/vm/jithelpers.cpp#L2437) ?) - I'd expect to see time spent inside the fallback call to gc, but it's not (if the trace is accurate). Another hint from the general trace is that we don't really spend a lot of time in GC itself.

PS: what is `blr x1` call ? I presume it should be part of the `GetThread()`

cc @mangod9 @kunalspathak 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JIT_NewS_MP_FastPortable is slow for Orchard CMS benchmark on Linux-arm64 #99552

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JIT_NewS_MP_FastPortable is slow for Orchard CMS benchmark on Linux-arm64 #99552

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions