Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Write barrier optimizations for ARM64 Windows #22003

Merged

Conversation

adityamandaleeka
Copy link
Member

This change is a step towards unification of the ARM64 write barrier logic between Windows and Unix. It brings over some of the changes that were done for Unix in #12334 such as using a literal pool to hold heap location/geometry information used in the barriers.

Parts of the code have been tweaked in pursuit of performance gains.

  • The cmp+ccmp pair has been replaced with two discrete cmp/branches to detangle it and allow for a faster exit if the lower bound check fails.
  • The ephemeral bounds are now being loaded simultaneously with an ldp rather than separately prior to each compare.
  • The shift for indexing into the card table has been separated out into its own instruction.

Sampling a write barrier-heavy test after these changes shows a ~7-12% decrease in the time spent in the barrier relative to the current, post-optimization version that's in use on Unix today.

Once this is in, I plan to port the deltas over to Unix so that the barriers will be in sync. I left the CLR writewatch and manually managed card bundles stuff alone on Windows for now since it's not enabled yet, but I'll likely experiment with those in the near future and check in the remaining pieces after doing so.

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. Nice perf wins discovered too.

@AndyAyersMS
Copy link
Member

The cmp+ccmp pair has been replaced with two discrete cmp/branches to detangle it and allow for a faster exit if the lower bound check fails.

It looks like you changed from two compare/branches to two compares and one branch, not the other way round...

@adityamandaleeka
Copy link
Member Author

@AndyAyersMS Sorry for the confusion. My comment was in reference to the ways my change digresses from the optimizations done on ARM64 Unix. With that in mind, the cmp+ccmp in JIT_WriteBarrier has been replaced with the two discrete cmps/branches.

The checked write barrier, where the cmp+ccmp thing was also done, has not been modified in this way. It will be interesting to get data about whether the address tends to be within the bounds of the heap, but until then I'll leave it alone.

Copy link

@sdmaclea sdmaclea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks

@adityamandaleeka
Copy link
Member Author

@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test

Copy link
Member

@janvorli janvorli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@@ -366,57 +466,58 @@ NotInHeap
; if ([x14] == x15) goto end
ldr x13, [x14]
cmp x13, x15
beq shadowupdateend
beq ShadowUpdateEnd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit - can you please fix alignment of the label?

@adityamandaleeka
Copy link
Member Author

@dotnet-bot test Windows_NT arm64 Cross Checked Innerloop Build and Test

@adityamandaleeka adityamandaleeka merged commit 9fb7676 into dotnet:master Jan 24, 2019
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…rrier_updates_arm64

Write barrier optimizations for ARM64 Windows

Commit migrated from dotnet/coreclr@9fb7676
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants