Skip to content

[opt] Update alloc_refs on the stack to have stack memory kind. #31423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

zoecarver
Copy link
Contributor

Update MemAccessUtils to classify class allocs on the stack as "AccessedStorage::Kind::Stack" instead of "::Class". This allows class element access markers to be promoted to have static access enforcement.

Right now we're really bad at optimizing classes. I've made a few patches recently to address some of the issues I've seen in the optimizer around class optimization. They're mostly similar patches (or at least addressing similar problems) and over time they seem to have gotten more and more general. I think this may finally be the right patch.

Fixing this edge case around access enforcement is (hopefully, we'll see what the benchmarks say) a big performance improvement for a relatively small patch. And, more importantly, this feels like the "correct" way to address the problem (rather than storing class members in tuples =P).

@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

1 similar comment
@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

@zoecarver
Copy link
Contributor Author

Does anyone know why the benchmark failed? It looks like it didn't actually run any tests and everything compiled OK. I did a clean build/test on two presets locally and couldn't reproduce the error. I'll keep investigating.

@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

@zoecarver zoecarver force-pushed the opt/static-access-marker branch from 48097a3 to aa1dd01 Compare May 1, 2020 04:41
@@ -92,7 +92,6 @@ AccessedStorage::AccessedStorage(SILValue base, Kind kind) {
value = base;
break;
case Stack:
assert(isa<AllocStackInst>(base));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep the assertion here but, it would require duplicating the code in visitClassAccess and I don't think it's worth it just for the assertion. WDYT?

@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

1 similar comment
@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented May 1, 2020

Performance: -O

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 4183 6276 +50.0% 0.67x (?)
RecursiveOwnedParameter 55 65 +18.2% 0.85x (?)
FlattenListLoop 2865 3322 +16.0% 0.86x (?)
ProtocolDispatch 196 217 +10.7% 0.90x (?)
PointerArithmetics 21700 23900 +10.1% 0.91x (?)
DataAppendBytesMedium 2980 3220 +8.1% 0.93x (?)
DropFirstArrayLazy 13 14 +7.7% 0.93x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
RecursiveOwnedParameter.o 2270 2318 +2.1% 0.98x
ObjectAllocation.o 4571 4667 +2.1% 0.98x
NopDeinit.o 4007 4071 +1.6% 0.98x

Performance: -Osize

Improvement OLD NEW DELTA RATIO
ClassArrayGetter2 1260 80 -93.7% 15.75x
MapReduceClass2 148 29 -80.4% 5.10x
MapReduceLazyCollectionShort 67 36 -46.3% 1.86x (?)
MapReduceClassShort2 268 145 -45.9% 1.85x (?)
MapReduceString 48 42 -12.5% 1.14x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
DictTest4.o 14040 14184 +1.0% 0.99x
 
Improvement OLD NEW DELTA RATIO
DictionaryBridge.o 3595 3491 -2.9% 1.03x
LinkedList.o 2545 2473 -2.8% 1.03x
ArrayOfGenericRef.o 9246 9038 -2.2% 1.02x
ArrayOfRef.o 9587 9379 -2.2% 1.02x
OpaqueConsumingUsers.o 2630 2582 -1.8% 1.02x
RGBHistogram.o 20112 19808 -1.5% 1.02x

Performance: -Onone

Regression OLD NEW DELTA RATIO
ObjectiveCBridgeStubNSDateRefAccess 3822 4498 +17.7% 0.85x (?)
DataAppendDataSmallToSmall 2860 3080 +7.7% 0.93x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@zoecarver
Copy link
Contributor Author

I think the benchmarks will be improved after #31134 lands.

@zoecarver
Copy link
Contributor Author

Now that #31134 is committed, I'm going to run the benchmarks again.

@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

@swift-ci
Copy link
Contributor

swift-ci commented May 8, 2020

Performance: -O

Regression OLD NEW DELTA RATIO
FlattenListFlatMap 7942 9875 +24.3% 0.80x (?)
FlattenListLoop 4570 5291 +15.8% 0.86x (?)
ObjectiveCBridgeStubToNSDate2 550 620 +12.7% 0.89x (?)
DataAppendDataSmallToSmall 4180 4660 +11.5% 0.90x (?)
 
Improvement OLD NEW DELTA RATIO
ObserverForwarderStruct 1030 905 -12.1% 1.14x (?)
String.data.LargeUnicode 136 121 -11.0% 1.12x (?)
DictionarySubscriptDefaultMutationOfObjects 1680 1500 -10.7% 1.12x (?)
ObserverUnappliedMethod 1300 1180 -9.2% 1.10x (?)
String.data.Medium 121 112 -7.4% 1.08x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
RecursiveOwnedParameter.o 2270 2318 +2.1% 0.98x
ObjectAllocation.o 4571 4667 +2.1% 0.98x
NopDeinit.o 4007 4071 +1.6% 0.98x

Performance: -Osize

Improvement OLD NEW DELTA RATIO
ClassArrayGetter2 1820 130 -92.9% 14.00x
MapReduceClass2 211 52 -75.4% 4.06x
MapReduceClassShort2 374 223 -40.4% 1.68x
ObjectiveCBridgeStubToNSDate2 630 550 -12.7% 1.15x (?)
DictionarySubscriptDefaultMutationOfObjects 1720 1540 -10.5% 1.12x (?)
FlattenListFlatMap 7611 6927 -9.0% 1.10x (?)
SortAdjacentIntPyramids 1695 1575 -7.1% 1.08x

Code size: -Osize

Regression OLD NEW DELTA RATIO
DictTest4.o 14040 14184 +1.0% 0.99x
 
Improvement OLD NEW DELTA RATIO
DictionaryBridge.o 3595 3491 -2.9% 1.03x
LinkedList.o 2545 2473 -2.8% 1.03x
ArrayOfGenericRef.o 9246 9038 -2.2% 1.02x
ArrayOfRef.o 9587 9379 -2.2% 1.02x
OpaqueConsumingUsers.o 2630 2582 -1.8% 1.02x
RGBHistogram.o 20112 19808 -1.5% 1.02x

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB

@gottesmm gottesmm requested a review from atrick May 8, 2020 23:32
@zoecarver
Copy link
Contributor Author

@atrick friendly ping :)

Copy link
Contributor

@atrick atrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid this isn't nearly as simple as changing the fundamental abstraction AccessedStorage. That abstraction models the language rules so can't change its meaning.

It's very easy to write incorrect access enforcement optimizations that give you great-looking performance wins. No tests will fail, and no programs will crash. We won't find out that exclusivity isn't being enforced until a whole bunch of code has shipped without enforcement.

It looks like you have several really nice test cases where we want to improve AccessEnforcementOpts. So for each one, we need to make an argument for why the optimizer should be able to recognize it, decide which of the AccessEnforcementOpts should handle it, construct a logical proof for an algorithm that could handle that case while being generally correct. Finally, we can add information to the analyses that feed the optimization to provide extra info.

For example, removeLocalNonNestedAccess should demote access markers to [static] for any locally allocated objects that don't escape. The pass does not currently use EscapeAnalysis, but it could.

Knowing that the object was allocated on the stack will probably not be useful for access enforcement, but knowing that a reference does not escape is very useful.

@@ -534,6 +534,10 @@ static void addClosureSpecializePassPipeline(SILPassPipelinePlan &P) {
static void addLowLevelPassPipeline(SILPassPipelinePlan &P) {
P.startPipeline("LowLevel");

P.addAccessEnforcementOpts();
P.addAccessEnforcementWMO();
P.addAccessMarkerElimination();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, when adding very expensive passes to the pipeline, it's good to document why they need to be run at that point so we can improve the pipeline later. What other passes are they dependent on that requires them to be rerun?

With AccessMarkerElimination, ideally we would fix whatever optimization pass can't handle access markers rather than running the elimination pass since a short term goal is supporting access markers throughout the pipeline.

while (isa<StructElementAddrInst>(operand) ||
isa<RefElementAddrInst>(operand) ||
isa<TupleElementAddrInst>(operand) || isa<LoadInst>(operand))
operand = operand.getDefiningInstruction()->getOperand(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's happening here with peeking through operands. The UseDefChainVisitor needs to visit each instruction on the "access path" between the address of the memory access and the address of the access to formal storage.

operand = operand.getDefiningInstruction()->getOperand(0);

if (isa<AllocStackInst>(operand))
return asImpl().visitBase(field, AccessedStorage::Stack);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an access is to a class property, it doesn't matter where the reference to the class is stored. Once you reach the base of the formal access, you've found the AccessStorage. This corresponds to language-level constructs: global variable, class property, local variable.

return asImpl().visitBase(field, AccessedStorage::Stack);
if (auto allocRef = dyn_cast<AllocRefInst>(operand)) {
if (allocRef->isAllocatingStack())
return asImpl().visitBase(field, AccessedStorage::Stack);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class properties have specific semantics at the language level and specific structural properties in SIL. Whether the object is allocated on the heap or stack is irrelevant to any of this. If we have an optimization that can somehow benefit from knowing that an object is stack-allocated, then we can teach that optimization about stack-allocated objects. But we can never pretend that a class property is a local variable.

@zoecarver
Copy link
Contributor Author

zoecarver commented Jul 12, 2020

@atrick thanks for the review. There's a lot of really great information here, thanks for sharing it :)

I'm going to close this PR because, as you said, the entire premise relies on an incorrect idea: stack allocations and local variables are synonymous. I've created #32844 which is the start of a more correct implementation of this same optimization.

Update MemAccessUtils to clasify class allocs on the stack as "AccessedStorage::Kind::Stack" instead of "::Class". This allows class element access markers to be promoted to have static access enforcement.
@zoecarver zoecarver reopened this Jul 15, 2020
@zoecarver zoecarver force-pushed the opt/static-access-marker branch from aa1dd01 to da23b35 Compare July 15, 2020 03:53
@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

1 similar comment
@zoecarver
Copy link
Contributor Author

@swift-ci please benchmark

@zoecarver
Copy link
Contributor Author

I'm re-testing after removing AccessMarkerElimination to see if that was the reason for the performance improvements.

@swift-ci
Copy link
Contributor

Performance: -O

Improvement OLD NEW DELTA RATIO
RecursiveOwnedParameter 66 54 -18.2% 1.22x
ObjectiveCBridgeStubToNSDate2 360 330 -8.3% 1.09x (?)
RemoveWhereMoveInts 12 11 -8.3% 1.09x (?)
ObserverUnappliedMethod 700 650 -7.1% 1.08x (?)
ObjectiveCBridgeStubToNSDateRef 2320 2160 -6.9% 1.07x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
RecursiveOwnedParameter.o 2150 2198 +2.2% 0.98x
ObjectAllocation.o 4451 4547 +2.2% 0.98x

Performance: -Osize

Regression OLD NEW DELTA RATIO
FlattenListLoop 928 1019 +9.8% 0.91x (?)
 
Improvement OLD NEW DELTA RATIO
FlattenListFlatMap 4354 3678 -15.5% 1.18x (?)
Dictionary4 225 207 -8.0% 1.09x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
ObjectAllocation.o 4298 4378 +1.9% 0.98x
DictTest4.o 13600 13760 +1.2% 0.99x
 
Improvement OLD NEW DELTA RATIO
LinkedList.o 2257 2200 -2.5% 1.03x
ArrayOfGenericRef.o 8304 8096 -2.5% 1.03x
ArrayOfRef.o 8581 8389 -2.2% 1.02x
OpaqueConsumingUsers.o 2351 2303 -2.0% 1.02x
RGBHistogram.o 19552 19232 -1.6% 1.02x
DictionaryBridge.o 3309 3269 -1.2% 1.01x

Performance: -Onone

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@zoecarver zoecarver closed this Jul 15, 2020
@zoecarver
Copy link
Contributor Author

The interesting thing shown above is that putting AccessMarkerElimination in LowLevel,Function seems to be the cause of the performance improvements.

@atrick
Copy link
Contributor

atrick commented Jul 15, 2020

@zoecarver yeah, that's pretty clearly a problem of some pass not handling access markers. That should just be fixed independently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants