Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[macOS] net6-macos performance is 26x slower than the equivalent xamarin.mac #8890

Closed
jeromelaban opened this issue May 26, 2022 · 12 comments
Closed
Assignees
Labels
area/performance 📈 Categorizes an issue or PR as relevant to performance difficulty/challenging 🤯 Categorizes an issue for which the difficulty level is reachable with internals understanding kind/bug Something isn't working platform/macos 🍏 Categorizes an issue or PR as relevant to the macOS platform

Comments

@jeromelaban
Copy link
Member

Current behavior

Run this xamarin.mac sample, then this sample for net6.0-macos.

You'll notice that the net6-macos performance is significantly lower than xamarin.mac performance.

Expected behavior

The performance is at least equal for both targets.

How to reproduce it (as minimally and precisely as possible)

No response

Workaround

No response

Works on UWP/WinUI

No response

Environment

No response

NuGet package version(s)

4.3.6

Affected platforms

macOS

IDE

No response

IDE version

No response

Relevant plugins

No response

Anything else we need to know?

No response

@jeromelaban jeromelaban added kind/bug Something isn't working triage/untriaged Indicates an issue requires triaging or verification difficulty/tbd Categorizes an issue for which the difficulty level needs to be defined. area/performance 📈 Categorizes an issue or PR as relevant to performance and removed triage/untriaged Indicates an issue requires triaging or verification labels May 26, 2022
@spouliot
Copy link
Contributor

I ran both samples, full screen (HD), with release builds

macOS 12.3.1 / arm64 Build Change Reuse Grid.
Xamarin.Mac Legacy 853.22 336635.53 909.64 5356.02
net6.0-macos 667.86 82385.28 [1] [1]

That makes XM/legacy between 1.28 times (Build) and 4.09 times (Change) faster than net6.0-macos.

[1] Code is currently commented inside the repo

I don't get anywhere as dramatic as 26x but I might be doing some things differently/incorrectly (or that number is about a subset of the benchmark sample). I'm also doing this on a M1-based Mac but running an Intel binaries, while slower, does not change the ratios significally.

Unlike iOS there's no way to disable ObjC exceptions marshalling on CoreCLR 😢

I'll profile the sample to see if there are other places where the slowdown occurs...

@spouliot
Copy link
Contributor

spouliot commented May 27, 2022

Access to NSView.Subviews was slowed down (inside Xamarin) by calling native code too often for a constant (instead of cache its constant value).

Screen Shot 2022-05-27 at 1 07 28 PM

That's more than 20 seconds out of the 2:30m benchmark run. Complete data attached to xamarin/xamarin-macios#15145

Fix xamarin/xamarin-macios#15146

Note: the same is also true of UIView.Subviews and other NSArray collections. That fix should benefit XI, XM and net6 performance. IOW it might not change the ratio by a lot - but all numbers should be better (compared to other platforms).

Dopes on Xamarin.Mac (legacy) comparison

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac (before) 853.22 336635.53 909.64 5356.02
Xamarin.Mac (before) 1389.93 413811.25 1635.81 5402.95
Speedup 1.63x 1.23x 1.80x 1.01x

note: Legacy was used since it's easier/faster to update and measure (and test two more cases)

@spouliot
Copy link
Contributor

The same data shows another 6.25 seconds (out of the 2:30m benchmark run) is being spent inside IsUserType (inside Xamarin's runtime).

Screen Shot 2022-05-28 at 11 32 58 AM

Fix xamarin/xamarin-macios#15149

Just like the previous fix/PR this is not specific to macOS or CoreCLR and will benefit XI, XM and net6 performance. Still the performance ratio might vary a bit (since native calls are reduced and more costly on CoreCLR).

Dopes on Xamarin.Mac (legacy) comparison

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac (before) 853.22 336635.53 909.64 5356.02
Xamarin.Mac (after) 881.60 381309.84 943.66 6543.45
Speedup 1.03x 1.13x 1.04x 1.22x

note: the above numbers do not include the perf enhancement of the previous fix

@jeromelaban
Copy link
Member Author

Very nice findings!! This could explain some of the remaining differences we found when running net6.0-ios.

@spouliot
Copy link
Contributor

The current dotnet project file (.csproj) does not link/trim on macOS release builds, while it's specified for XM legacy.

diff --git a/src/dopes/Uno-dotnet6/DopeTestUno/DopeTestUno.Mobile/DopeTestUno.Mobile.csproj b/src/dopes/Uno-dotnet6/DopeTestUno/DopeTestUno.Mobile/DopeTestUno.Mobile.csproj
index a396d11..49ba4a7 100644
--- a/src/dopes/Uno-dotnet6/DopeTestUno/DopeTestUno.Mobile/DopeTestUno.Mobile.csproj
+++ b/src/dopes/Uno-dotnet6/DopeTestUno/DopeTestUno.Mobile/DopeTestUno.Mobile.csproj
@@ -66,6 +66,8 @@
                </When>
                <When Condition="'$(TargetFramework)'=='net6.0-macos'">
                        <PropertyGroup>
+                               <RuntimeIdentifier>osx-arm64</RuntimeIdentifier>
+                               <TrimMode Condition="'$(Configuration)'=='Release'">link</TrimMode>
                        </PropertyGroup>
                </When>
        </Choose>

Beside the usual trimming benefits the Xamarin linker does several binding optimizations (less for macOS since it allow JITing) like removing non-required code branches (that won't have to execute at runtime).

macOS 12.3.1 / arm64 Build Change Reuse Grid.
Xamarin.Mac Legacy 853.22 336635.53 909.64 5356.02
net6.0-macos 667.86 82385.28 [1] [1]
net6.0-macos (link) 730.99 100016.73 [1] [1]

[1] Code is currently commented inside the repo

Once things are merged I'll update the numbers and start looking at the native/unmanaged parts (even if I strongly suspect the lack of option to disable ObjC exceptions marshalling on CoreCLR to be the main culprit).

@rolfbjarne
Copy link

note: Legacy was used since it's easier/faster to update and measure (and test two more cases)

One thought: you could enable exception marshalling on Legacy and see how much that slows it down. Then you could also enable the pinvoke wrapper code (xamarin/xamarin-macios#14961) and see how much speed you gain back.

@spouliot
Copy link
Contributor

Updated numbers with the compounded effect of several fixes, in particular those three

Speedups with an updated Xamarin Mac SDK

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy (original) 853.22 336635.53 909.64 5356.02
Xamarin.Mac Legacy (updated) 1594.97 465188.25 1903.63 6746.49
Speedup 1.87x 1.38x 2.09x 1.26x

Impact of ObjC exception marshalling

Next this compares numbers when (the updated XM) has ObjC exception marshalling enabled (--marshal-objectivec-exceptions:throwmanagedexception). This will ease comparison with dotnet numbers (where this setting is the default and cannot be disabled).

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy (updated) 1594.97 465188.25 1903.63 6746.49
Xamarin.Mac Legacy (w/exc marsh) 1380.25 102693.28 1724.54 3956.16
Impact (slowdown) 0.87x 0.22x 0.91x 0.59x

In theory [1] we should see a similar between XM legacy and the dotnet numbers.

The original numbers (in top in the issue) showed

macOS 12.3.1 / arm64 Build Change Reuse Grid.
Xamarin.Mac Legacy 853.22 336635.53 909.64 5356.02
net6.0-macos 667.86 82385.28 [1] [1]
Impact (slowdown) 0.78x 0.24x N/A N/A

which are close but hint at something more in case of the Build numbers [2]. Comparison with the updated numbers to be done shortly...

[1] that the slowdown is caused only by the ObjC exception marshalling
[2] the object creation code differs quite a bit between mono and coreclr

@spouliot
Copy link
Contributor

Comparing with ObjC exception marshalling enabled

This is XM (legacy) with ObjC exception marshalling (enabled) and net6.0-macos (where it's always enabled). A ratio of 1.0 means identical performance, more than 1.0 means increased performance (for dotnet) and less than 1.0 means decreased performance (for dotnet).

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy (w/exc marsh) 1380.25 102693.28 1724.54 3956.16
net-6.0 (updated) 1234.81 112575.12 [1] [1]
Impact (slowdown) 0.87x 1.10x N/A N/A

[1] Code is currently commented inside the repo, comparison is not possible

@spouliot
Copy link
Contributor

spouliot commented Jun 2, 2022

@rolfbjarne It's not yet possible to use the enable the pinvoke wrapper code to compare the performance. Filed as xamarin/xamarin-macios#15190

@spouliot
Copy link
Contributor

spouliot commented Jun 4, 2022

Comparing p/invoke wrappers enabled

I had to hack around an issue (in quite an hackish way) but I was able to get the numbers

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy (updated) 1594.97 465188.25 1903.63 6746.49
net-6.0 (w/pinvoke wrappers) 1460.71 376612.26 [1] [1]
Impact (slowdown) 0.92x 0.81x N/A N/A

So the wrappers are helping a lot, net6.0-macos numbers are at best yet (see previous numbers) - but not quite what XM legacy was able to achieve.

We can also compare the ratios (other numbers can't be compared due to other fixes that were applied) with the original numbers

macOS 12.3.1 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy 853.22 336635.53 909.64 5356.02
net6.0-macos 667.86 82385.28 [1] [1]
Impact (slowdown) 0.78x 0.24x N/A N/A

The performance gap, between mono and coreclr runtimes, as narrowed considerably but still exists.

[1] Code is currently commented inside the repo, comparison is not possible

spouliot added a commit to spouliot/performance that referenced this issue Jun 6, 2022
The Dope's Xamarin.Mac legacy project already has this setting.
In order to compare apples-to-apples it's best to share the same
settings between Xamarin.Mac legacy and net6.0-macos.

Enabling the linker/trimmer does more than just removing unused code.
It also optimize the bindings, removing code paths that are not used
inside the app, e.g. removing 32bits code paths on 64bits builds.

It's a good, performance enhancing, setting for release builds so it
make sense to have it set inside samples - leading by example :)

ref: unoplatform/uno#8890 (comment)
@spouliot
Copy link
Contributor

spouliot commented Jun 9, 2022

Comparing p/invoke wrappers enabled (updated)

Updates

  • using xamarin-macios\main to include the fix for #15190 --require-pinvoke-wrappers:true (instead of my previous hack)
    • that includes a few other commits - but none were related/targetted to performance.
  • using to macOS 12.4 (instead of 12.3.1) as it's required for Xcode 14 betas (not used for the benchmarks)
macOS 12.4 / arm64 Build Change Reuse Grid
Xamarin.Mac Legacy (454e16a) 1574.60 440982.29 1950.91 6602.72
net-6.0 (w/pinvoke wrappers) 1467.18 385994.96 [1] [1]
Impact (slowdown) 0.93x 0.88x N/A N/A

So net-6.0-macos number are between 7% and 12% slower using CoreCLR.

Pre-Analysis Notes

  • Dope

    • Executes 4 (legacy) or 2 (net6.0-macos) sequential tests. Only the same two tests were compared.
    • Numbers are averages. During execution I could see huge drops in the performance and sometimes the screen literally stop updating for a moment.
    • The app startup times are not measured by the tests.
  • Xamarin Runtime Fixes

    • Have cleared up the flamegraphs a lot so they are not much easier to understand / diagnose.
    • Have shrunk the ratios between Mono (legacy) and CoreCLR (net6.0-macos).

Data Analysis

  • Dope spent a lot of time calling native code (mostly thru the ObjC bridge and some p/invokes).

    • Many apps won't be doing as much calls to native code so their performance profile will be different.
    • From Uno perspective, decreasing the number of native calls (or finding cheaper alternatives) would help performance.
  • The first 60 seconds are spent in the Build test. It's allocation intensive and triggers the GC several times.

    • Instruments shows that the GC pauses are shorter [2] on Mono (compared to CoreCLR). This gives a better average score to Mono.
    • It also explains why reducing the allocations has shrunk the performance ratio between the two runtimes.
  • The second 60 seconds are spent inside the Changes test. It's dominated by drawInRect:options:context and related CoreGraphics code for fonts. According to Instruments:

    • 98.7% (net6.0-macios) of the time is spend measure/drawing. This match the data from the managed (dotnet trace) side. The next 1.1% is spend finalizing objects.
    • 99.3% (mono) of the time is spend measure/drawing, next come finalizing objects at 0.5%.
    • In both cases there's no GC spikes (likely due to much less allocations from this test)

Conclusion

  • This benchmark is really about native code calls/transitions first, second would be GC / finalization. If the CoreCLR JIT is better/faster than it does not have the chance to prove it with this benchmark. YMMV

  • The Xamarin runtime fixes have been advantaging CoreCLR, even if they were runtime agnostics (and benefits both), since they reduced the pressure on the GC.

  • Reducing the number of memory allocations (inside Uno) would reduce the number of finalization and GC collections. That should result in more stable performance and a better average (for both runtimes). That would likely favor CoreCLR numbers but still benefit both runtime.

  • Latest flamegraphs are much more reliable in showing where the time is spent, inside Uno, and where optimizations would yield interesting results. Even if this was done for macOS a lot of the code (and fixes) should help iOS performance.

  • There's nothing as brutal as #14835 here. Maybe because it was partially fixed, more likely because it affects net6 + mono (and not net6 + coreclr as used for macOS).

  • Removing subviews does trigger ConformsToProtocol and this cause the Xamarin runtime to do reflection (if the dynamic registrar is present, which is the default for macOS). This looks similar to [net6/Xamarin.iOS] NSObject.ConformsToProtocol is very costly and invoked very often xamarin/xamarin-macios#14065 but the condition that triggers this, inside Dope, might be different (but the solution, caching, might benefit all cases).

  • With the use of --require-pinvoke-wrappers:true (on CoreCLR) the trampolines are almost absent, as is no self time, from the data

[1] Code is currently commented inside the repo, comparison is not possible
[2] shorter, not necessarly faster/better since the outcome can be different

@MartinZikmund MartinZikmund added platform/macos 🍏 Categorizes an issue or PR as relevant to the macOS platform difficulty/challenging 🤯 Categorizes an issue for which the difficulty level is reachable with internals understanding and removed difficulty/tbd Categorizes an issue for which the difficulty level needs to be defined. labels Jul 27, 2023
@jeromelaban
Copy link
Member Author

Closing for improvements done in the runtime (xamarin/xamarin-macios#15145 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance 📈 Categorizes an issue or PR as relevant to performance difficulty/challenging 🤯 Categorizes an issue for which the difficulty level is reachable with internals understanding kind/bug Something isn't working platform/macos 🍏 Categorizes an issue or PR as relevant to the macOS platform
Projects
None yet
Development

No branches or pull requests

4 participants