Skip to content

Comments

[WIP][Perf] Optimize allocations in the layout engine#34155

Draft
simonrozsival wants to merge 11 commits intonet11.0from
dev/simonrozsival/layout-perf-research
Draft

[WIP][Perf] Optimize allocations in the layout engine#34155
simonrozsival wants to merge 11 commits intonet11.0from
dev/simonrozsival/layout-perf-research

Conversation

@simonrozsival
Copy link
Member

@simonrozsival simonrozsival commented Feb 20, 2026

Fixes #34154

Description

This PR eliminates all managed heap allocations in the Core layout engine (Grid, Flex) during steady-state measure+arrange passes. The changes target the hot path that runs on every layout cycle — in a typical app with scrolling lists or animated layouts, this path executes thousands of times per second.

Why this matters

Every allocation in the layout hot path contributes to GC pressure. On mobile devices (Android/iOS), Gen0 collections during layout can cause frame drops. By eliminating allocations entirely from the Core layout engine, we remove GC as a variable in layout performance.

What changed

GridLayoutManager.cs (largest change)

1. GridStructure class → struct

The GridStructure class was allocated on every Measure() call. Converting it to a struct and storing it as a field on GridLayoutManager eliminates this allocation. Because structs are value types, method calls on struct fields operate directly on the field (no defensive copies) — the field is intentionally non-readonly.

Uses _hasGridStructure bool instead of nullable (can't use Nullable<GridStructure> because .Value would copy the entire struct).

2. Cell class → struct

Each Cell tracked a child's grid position and measurement constraints. Converting to a struct eliminates per-child allocations. Methods that mutate cells now use ref Cell parameters to avoid copy-mutation bugs.

3. Definition class → struct

Each Definition tracked a row/column's size and grid length. Converting to a struct and adding readonly modifiers to pure getters. Fixed a pre-existing copy-mutation bug in EnsureSizeLimit where var def = defs[n]; def.Size = newSize; silently mutated a copy — now correctly uses defs[n].Size = newSize;.

4. SpanKey record → readonly struct with IEquatable

The SpanKey was a record (reference type with heap allocation). Converted to a readonly struct implementing IEquatable<SpanKey> to eliminate allocations when used as Dictionary keys. Implements proper GetHashCode() using HashCode.Combine (with netstandard fallback).

5. Span class eliminated

The Span class bundled a SpanKey with a Requested double. Eliminated the class entirely — TrackSpan now takes individual parameters and the dictionary stores Dictionary<SpanKey, double> directly.

6. ArrayPool for all arrays

All four arrays in GridStructure now use ArrayPool<T>.Shared:

  • _childrenToLayOut (IView[]): cleared on return to avoid holding references
  • _cells (Cell[]): struct array, no clearing needed
  • _rows (Definition[]): struct array, no clearing needed
  • _columns (Definition[]): struct array, no clearing needed

Rented arrays may be larger than requested — actual counts tracked via _childCount, _rowCount, _columnCount. All array loops use these counts instead of .Length.

ReturnArrays() is called at the start of Measure() before creating a new GridStructure. The new structure is constructed into a local first, then the old arrays are returned and the field is swapped — this ensures exception safety if the constructor throws.

7. Dictionary reuse for span tracking

Dictionary<SpanKey, double>? _spansDictionary field on GridLayoutManager is passed into GridStructure via constructor. On subsequent calls, .Clear() reuses the dictionary instead of allocating a new one. Lazy initialization (_spans ??= new()) still works for grids with no spanning children.

8. foreach → for in ArrangeChildren

foreach on IGridLayout (interface dispatch) boxes the List<IView>.Enumerator struct — verified by benchmark: 1.56 KB/op with foreach vs 0 B with indexed for. Internal array loops (_cells, _rows, _columns) also use for with count because ArrayPool-rented arrays are oversized.

Flex.cs

9. SelfSizing float[] elimination

SelfSizingDelegate was called with float[] size = new float[2] { w, h } — allocating a 2-element array per child per layout pass. Replaced with ref float width, ref float height parameters, eliminating the allocation entirely.

10. InlineArray(4) for Frame buffer

Item.Frame was float[] Frame { get; } = new float[4] — each Item allocated a 4-element float array. Replaced with [InlineArray(4)] struct FrameBuffer that stores the 4 floats inline in the Item. Conditional on NET8_0_OR_GREATER with float[] fallback for netstandard.

11. ArrayPool for ordered_indices and lines

  • ordered_indices: ArrayPool<int>.Shared.Rent(item.Count) in flex_layout.init, returned in cleanup()
  • lines (flex wrap lines): ArrayPool<flex_layout_line>.Shared.Rent(newCapacity) with manual copy+return for growth. Changed growth strategy from Array.Resize(+1) (linear, N allocations for N lines) to doubling (logarithmic).
  • cleanup() wrapped in try/finally to ensure arrays are always returned, even if layout throws.

InvalidationEventArgs.cs

12. Static cached instances

Added InvalidationEventArgs.GetCached(InvalidationTrigger) that returns static singletons per trigger value. Replaced new InvalidationEventArgs(trigger) in VisualElement, Page, and legacy Layout. These fire on every measure invalidation, which happens frequently during layout.

What we tried and didn't work (across all engines)

  • NSubstitute-based benchmarking for allocation measurement: NSubstitute mocks add 40–200% allocation noise that completely obscures real optimization gains. Mock indexer calls (_grid[n]) allocate tracking objects per invocation. Solution: Created LayoutAllocBenchmarker with lightweight hand-written fakes implementing IGridLayout/IStackLayout directly.

  • Optimizing remaining Flex Controls-layer allocations: The remaining ~848 B (12 children) ≈ 71 B per child per pass. Traced through the call chain: FlexLayoutManager.ArrangeChildrenchild.Arrange(frame)VisualElement.ArrangeOverrideUpdateBoundsComponents → sets X, Y, Width, Height via BindableObject.SetValue(property, doubleValue). Each SetValue boxes the double argument. This is fundamental to how BindableProperty works — fixing it requires generic BindableProperty<T> (tracked in [Perf] Eliminate value-type boxing in BindableObject.SetValue #34080).

  • Stack/FlexLayoutManager foreach→for and Count/Spacing caching: Benchmarks confirmed these had zero allocation impact — the compiler already optimizes foreach on concrete types, and Stack was already allocation-free. These changes were reverted to minimize maintenance overhead.

  • Grid _cells[] foreach→for: The C# compiler already optimizes foreach on arrays to indexed access — no enumerator boxing.

New Benchmarks

LayoutAllocBenchmarker

Lightweight fake objects (no NSubstitute) for true allocation measurement. Includes FakeView, FakeGridLayout, FakeStackLayout, FakeRowDefinition, FakeColumnDefinition. Benchmarks Grid, VStack, HStack, and Flex Core engine with [Params] for ChildCount (12, 60) and UseSpans (true, false).

LayoutHotPathBenchmarker

Uses NSubstitute for Grid/Stack and real Controls objects (FlexLayout + Border children) for Flex. Measures the full Controls-layer stack including VisualElement.Measure/Arrange.

InvalidationBenchmarker

Measures InvalidationEventArgs dispatch allocation (before/after static caching).

Benchmark Results

Config: Apple M1 Max · .NET 11.0 (Arm64 RyuJIT AdvSIMD) · BenchmarkDotNet ShortRun (3 iterations, 1 launch, 3 warmup) · Release build

Each benchmark invocation = 1 Measure + 1 Arrange pass. BDN handles iteration for statistical accuracy. Allocation numbers are deterministic (exact).

LayoutAllocBenchmarker — Core layer, lightweight fake objects

This benchmark uses hand-written fake IView/IGridLayout/IStackLayout implementations (no NSubstitute) to measure true layout engine allocations without mock infrastructure noise.

Baseline = origin/net11.0 with identical benchmark code copied over.

Grid (1× Measure + 1× Arrange per invocation)

Scenario Baseline Mean Optimized Mean Baseline Alloc Optimized Alloc
12 children, no spans 1,226 ns 1,434 ns 1,752 B 0 B
12 children, with spans 1,444 ns 1,353 ns 2,536 B 0 B
60 children, no spans 14,975 ns 13,879 ns 6,264 B 0 B
60 children, with spans 20,894 ns 18,744 ns 9,336 B 0 B
Raw BenchmarkDotNet output — baseline (net11.0)
| Method             | ChildCount | UseSpans | Mean          | Error         | StdDev      | Gen0   | Gen1   | Allocated |
|------------------- |----------- |--------- |--------------:|--------------:|------------:|-------:|-------:|----------:|
| GridMeasureArrange | 12         | False    |  1,225.67 ns  |  1,221.80 ns  |   66.971 ns | 0.2785 | 0.0019 |   1,752 B |
| GridMeasureArrange | 12         | True     |  1,443.53 ns  |  1,458.17 ns  |   79.927 ns | 0.4005 | 0.0038 |   2,536 B |
| GridMeasureArrange | 60         | False    | 14,975.27 ns  | 16,949.26 ns  |  929.046 ns | 0.9766 | 0.0305 |   6,264 B |
| GridMeasureArrange | 60         | True     | 20,893.90 ns  | 50,406.05 ns  | 2,762.925 ns | 1.4648 | 0.0610 |   9,336 B |
Raw BenchmarkDotNet output — optimized (this PR)
| Method             | ChildCount | UseSpans | Mean          | Error         | StdDev      | Allocated |
|------------------- |----------- |--------- |--------------:|--------------:|------------:|----------:|
| GridMeasureArrange | 12         | False    |  1,434.14 ns  |  5,630.49 ns  |  308.626 ns |       0 B |
| GridMeasureArrange | 12         | True     |  1,352.96 ns  |  2,120.01 ns  |  116.205 ns |       0 B |
| GridMeasureArrange | 60         | False    | 13,878.96 ns  | 17,710.17 ns  |  970.754 ns |       0 B |
| GridMeasureArrange | 60         | True     | 18,743.82 ns  | 50,700.86 ns  | 2,779.085 ns |       0 B |

(Gen0/Gen1/Gen2 all zero — omitted)

Flex Core engine (1× Layout per invocation, no Controls layer)

Scenario Baseline Mean Optimized Mean Baseline Alloc Optimized Alloc
12 children 853 ns 864 ns 224 B 0 B
60 children 6,661 ns 5,908 ns 832 B 0 B
Raw BenchmarkDotNet output — baseline (net11.0)
| Method                 | ChildCount | Mean       | Error       | StdDev    | Gen0   | Allocated |
|----------------------- |----------- |-----------:|------------:|----------:|-------:|----------:|
| FlexCoreMeasureArrange | 12         |    853 ns  |    163 ns   |    9.0 ns | 0.0353 |     224 B |
| FlexCoreMeasureArrange | 60         |  6,661 ns  | 10,747 ns   |  589.1 ns | 0.1297 |     832 B |
Raw BenchmarkDotNet output — optimized (this PR)
| Method                 | ChildCount | Mean       | Error       | StdDev    | Allocated |
|----------------------- |----------- |-----------:|------------:|----------:|----------:|
| FlexCoreMeasureArrange | 12         |    864 ns  |  1,177 ns   |   64.5 ns |       0 B |
| FlexCoreMeasureArrange | 60         |  5,908 ns  |  3,846 ns   |  210.8 ns |       0 B |

Stack (1× Measure + 1× Arrange per invocation)

Stack layout was already allocation-free. No changes to Stack code in this PR.

Scenario Baseline Mean Optimized Mean Alloc
VStack 12 children 77 ns 71 ns 0 B → 0 B
HStack 12 children 64 ns 61 ns 0 B → 0 B
VStack 60 children 285 ns 277 ns 0 B → 0 B
HStack 60 children 292 ns 274 ns 0 B → 0 B

LayoutHotPathBenchmarker — Flex end-to-end with real Controls objects

This benchmark uses real FlexLayout + Border children (Controls layer) to measure allocations through the full stack including VisualElement.Measure/Arrange.

Scenario Baseline Mean Optimized Mean Baseline Alloc Optimized Alloc Δ Alloc
Flex Wrap 12ch 7,297 ns 7,532 ns 1,843 B 848 B −54%
Flex NoWrap 12ch 7,352 ns 6,161 ns 1,761 B 848 B −52%
Flex Wrap 60ch 55,259 ns 49,450 ns 8,366 B 3,920 B −53%
Flex NoWrap 60ch 55,294 ns 42,876 ns 8,284 B 3,920 B −53%

The remaining ~848 B (12 children) ≈ 71 B per child, traced to VisualElement.UpdateBoundsComponents boxing doubles into BindableObject.SetValue for X/Y/Width/Height. This will be fixed by generic BindableProperty<T> (#34080).

Raw BenchmarkDotNet output — baseline (net11.0)
| Method                   | ChildCount | Mean      | Error      | StdDev     | Gen0   | Allocated |
|------------------------- |----------- |----------:|-----------:|-----------:|-------:|----------:|
| FlexMeasureArrangeWrap   | 12         |  7.162 us |   5.313 us |  0.2912 us | 0.2899 |  1,843 B  |
| FlexMeasureArrangeNoWrap | 12         |  7.548 us |  10.564 us |  0.5790 us | 0.2747 |  1,761 B  |
| FlexMeasureArrangeWrap   | 60         | 50.820 us | 109.091 us |  5.9797 us | 1.2817 |  8,366 B  |
| FlexMeasureArrangeNoWrap | 60         | 50.036 us | 124.319 us |  6.8144 us | 1.2207 |  8,284 B  |
Raw BenchmarkDotNet output — optimized (this PR)
| Method                   | ChildCount | Mean      | Error     | StdDev    | Gen0   | Allocated |
|------------------------- |----------- |----------:|----------:|----------:|-------:|----------:|
| FlexMeasureArrangeWrap   | 12         |  7.673 us | 20.961 us | 1.1489 us | 0.1297 |     848 B |
| FlexMeasureArrangeNoWrap | 12         |  6.377 us |  4.410 us | 0.2417 us | 0.1297 |     848 B |
| FlexMeasureArrangeWrap   | 60         | 58.989 us | 91.210 us | 4.9995 us | 0.6104 |   3,920 B |
| FlexMeasureArrangeNoWrap | 60         | 44.702 us | 29.324 us | 1.6074 us | 0.6104 |   3,920 B |

Note on GridLayoutManagerBenchMarker (existing, NSubstitute-based)

The existing GridLayoutManagerBenchMarker uses NSubstitute mocks. After struct conversions, each _grid[n] indexer call on a mocked IGridLayout allocates NSubstitute tracking objects — this is a benchmark artifact, not a real regression. The LayoutAllocBenchmarker with lightweight fakes confirms the optimizations work correctly with real objects.

Test Status

All 441 existing tests pass (394 Core layout + 47 Controls layout).

simonrozsival and others added 6 commits February 20, 2026 14:41
… reuse

- Convert Cell, Definition, and GridStructure from class to struct
- Use ArrayPool for IView[], Cell[], and Definition[] arrays
- Track actual counts (_childCount, _rowCount, _columnCount) for rented arrays
- Add int defsCount parameter to all static methods operating on Definition[]
- Reuse Dictionary<SpanKey, double> across measure passes (Clear instead of new)
- Convert SpanKey to IEquatable<SpanKey> with HashCode.Combine
- Convert foreach to indexed for loops in ArrangeChildren
- Add lazy Dictionary initialization for no-span grids
- Result: Grid layout achieves 0 B managed allocations (was 87-457 KB)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace float[] size = {w, h} in SelfSizing with two local variables
- Add [InlineArray(4)] FrameBuffer for Flex.Item.Frame (NET8_0_OR_GREATER)
- Use ArrayPool for ordered_indices and lines arrays in flex_layout
- Convert lines array growth from Array.Resize(+1) to doubling strategy
- Convert foreach to indexed for loops in FlexLayoutManager
- Change frame index fields from uint to int (InlineArray requirement)
- Result: Core Flex engine achieves 0 B managed allocations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Cache ILayout.Count and Spacing in local variables
- Convert foreach to indexed for loops in Measure/ArrangeChildren
- Cache childCount in StackLayoutManager.UsesExpansion
- Result: Stack layout maintains 0 B allocations with real objects

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add InvalidationEventArgs.GetCached(trigger) for cached instances
- Replace new InvalidationEventArgs(trigger) in VisualElement, Page, Layout
- Result: 0 B per invalidation dispatch

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LayoutAllocBenchmarker: lightweight fakes for TRUE allocation measurement
- LayoutHotPathBenchmarker: hot-path benchmarks with NSubstitute/Controls objects
- InvalidationBenchmarker: invalidation event dispatch benchmark
- initial-analysis.md: detailed performance analysis with results

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The remaining Flex Controls-layer allocations (64 B/child/pass) are caused
by BindableProperty.SetValue boxing doubles for X/Y/Width/Height in
VisualElement.UpdateBoundsComponents. Generic BindableProperty<T> will
eliminate this overhead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@simonrozsival simonrozsival added perf/general The issue affects performance (runtime speed, memory usage, startup time, etc.) (sub: perf) copilot labels Feb 20, 2026
The analysis content has been incorporated into the PR description
and issue #34154.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@simonrozsival simonrozsival changed the title [Perf] Optimize allocations in the layout engine [WIP][Perf] Optimize allocations in the layout engine Feb 20, 2026
@bcaceiro
Copy link

Let's not wait for .net11 for these awesome changes! 👯

Revert foreach→for and Count/Spacing caching in Stack and
FlexLayoutManager — benchmarks confirm these had zero allocation
impact (compiler already optimizes foreach on concrete types, and
Stack already used indexed for loops).

Remove manual LoopCount loops from benchmarks — let BenchmarkDotNet
handle iteration for proper statistical analysis. Per-operation
numbers are now directly readable.

Grid ArrangeChildren retains the foreach→for change because foreach
on IGridLayout (interface) boxes the List<IView>.Enumerator struct
(verified: 1.56 KB/op with foreach vs 0 B with for).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Grid: construct GridStructure into local first, then return old
  arrays and swap — prevents dangling state if constructor throws
- Flex: wrap layout_item body in try/finally to ensure ArrayPool
  arrays are always returned via cleanup() even on exceptions
- Grid: add comment documenting ArrayPool lifecycle and IDisposable
  consideration for GridLayoutManager

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

@pictos pictos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a huge win!

// Since no rows are specified, we'll create an implied row 0
return Implied();
_rows = ArrayPool<Definition>.Shared.Rent(1);
_rows[0] = new Definition(GridLength.Star);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this scenario, wouldn’t it be better to have a cached Definition[] oneRow = [new Definition(GridLength.Star) ? The smallest array returned by ArrayPool is of length 16, so this is maybe an overhead (?)

same for InitializeColumns

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, that makes a lot of sense.

For grids without explicit row/column definitions, cache the single-element
Definition[] arrays on the GridLayoutManager instead of renting them from
ArrayPool. The smallest pool bucket is 16 elements, so caching a reusable
Definition[1] avoids unnecessary pool overhead for the common case.

Suggested-by: Pictos
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@simonrozsival
Copy link
Member Author

Pushed 3bc89bd — caches the implied arrays on instead of renting from . When no explicit row/column definitions are specified, the cached array is reused across measure/arrange calls. Still 0 B allocated in benchmarks.

Extend the caching pattern from implied-only to all row/column arrays.
ArrayPool<Definition> is now completely eliminated — the manager owns
exact-sized Definition[] arrays that are reused across layout passes.
This avoids ArrayPool's minimum bucket of 16 elements, which was
wasteful for typical grids with 1-6 rows/columns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

copilot perf/general The issue affects performance (runtime speed, memory usage, startup time, etc.) (sub: perf)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants