-
Notifications
You must be signed in to change notification settings - Fork 0
Memory planning #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
mlir/include/mlir/Dialect/MemRef/Transforms/MergeAllocTickBased.h
Outdated
Show resolved
Hide resolved
mlir/include/mlir/Dialect/MemRef/Transforms/MergeAllocTickBased.h
Outdated
Show resolved
Hide resolved
using TraceCollectorFunc = std::function<FailureOr<MemoryTraceScopes>( | ||
Operation *, const BufferViewFlowAnalysis &, | ||
const MergeAllocationOptions &)>; | ||
using MemoryPlannerFunc = std::function<FailureOr<MemorySchedule>( | ||
Operation *, const LifetimeTrace &, const MergeAllocationOptions &)>; | ||
using MemoryMergeMutatorFunc = std::function<LogicalResult( | ||
Operation *toplevel, Operation *scope, const MemorySchedule &, | ||
const MergeAllocationOptions &)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're proposing an general fwk for merge-alloc, why not considering make these three func pointer to the
equivalent C++ class with necessary interface, then all the customized implementation can implement their own interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then there will be no such framework, as different implementations will handle all details internally. I split them into 3 functions to decouple the three stages and and facilitate code reuse.
mlir/docs/MemrefMergeAlloc.md
Outdated
func.func @mlp(%x: tensor<128x128xf32>, %y: tensor<128x128xf32>) -> tensor<128x128xf32> { | ||
%a0 = tensor.empty() : tensor<128x128xf32> | ||
%a = linalg.matmul ins(%x, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a0: tensor<128x128xf32>) -> tensor<128x128xf32> | ||
%b0 = tensor.empty() : tensor<128x128xf32> | ||
%b = linalg.matmul ins(%a, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%b0: tensor<128x128xf32>) -> tensor<128x128xf32> | ||
%c0 = tensor.empty() : tensor<128x128xf32> | ||
%c = linalg.matmul ins(%b, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%c0: tensor<128x128xf32>) -> tensor<128x128xf32> | ||
%d0 = tensor.empty() : tensor<128x128xf32> | ||
%d = linalg.matmul ins(%c, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%d0: tensor<128x128xf32>) -> tensor<128x128xf32> | ||
return %d : tensor<128x128xf32> | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can update this example with different buffer size for the tensors (like a0
and c0
, etc) or even complex scenario like control flow for better demonstrating the benefit of this proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a simple example to show the core idea of the pass: memory reuse, merged alloc, replace by memref.view
mlir/docs/MemrefMergeAlloc.md
Outdated
4. Use a "static-memory-planner" to handle the linear timeline | ||
|
||
Limitations of Tick-based merge-alloc: | ||
* only contiguous, static shaped and identical layout memrefs are considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems these are the common limitation for this fwk, not only this default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory, we don't need limit that. If an implementation can magically schedule the dynamic shaped tensor (e.g. by considering the additionaly restrictions of the shapes), it is possible to schedule the buffers.
mlir/docs/MemrefMergeAlloc.md
Outdated
```mlir | ||
func.func @mlp(...) { // <---- alloc scope 1 | ||
scf.for(...) { // <---- NOT an alloc scope! | ||
// allocation inside will be merge to alloc scope 1 above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rewrite order of nested alloc scope? How do you handle the new buffer which is hoisted out from scf.for
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hoisting is "one-shot" - it is decided by the getAllocationScope
. There will be no second run of hoisting.
mlir/docs/MemrefMergeAlloc.md
Outdated
in cache. In the default configuration (when `no-consider-locality` option is | ||
not specified to the merge-alloc pass), static memory planner considers both | ||
cache-locality and the degree of matching of allocation size and the chunk size | ||
for each free memory chunks, with a simple cost-model. With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cost-model" might be too heavy? use "heuristic rule" instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to rename the pass option as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about locality-and-size
and size-only
?
opt.tracer = memref::TickCollecter(); | ||
opt.planner = memref::tickBasedPlanMemory; | ||
opt.mutator = memref::MergeAllocDefaultMutator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to move the default implementation into a sub namespace of memref
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is enough to put it in memref namespace? Maybe we can let the community decide?
Operation *op) const { | ||
auto parent = op; | ||
for (;;) { | ||
parent = parent->getParentWithTrait<OpTrait::AutomaticAllocationScope>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this correctly handle sth like below:
scf.forall(...)
...
scf.for(...)
scf.if(...)
%buf = memref.alloc() :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, scf.if
is not a AutomaticAllocationScope
, the pass will find the scf.forall
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need special handle as well for the allocation happened inside scf.if
? as it might not always true that we need to allocate this buffer, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. scf.if
is RegionBranchOpInterface
which we also take care of. :)
In cases (like the ones added in the tests) where the condition of a masked load or store is a splat but not a constant (that is, a masked operation is being used to implement patterns like "load if the current lane is in-bounds, otherwise return 0"), optimize the 'scalarized' code to perform an aligned vector load/store if the splat constant is true. Additionally, take a few steps to preserve aliasing information and names when nothing is scalarized while I'm here. As motivation, some LLVM IR users will genatate masked load/store in cases that map to this kind of predicated operation (where either the vector is loaded/stored or it isn't) in order to take advantage of hardware primitives, but on AMDGPU, where we don't have a masked load or store, this pass would scalarize a load or store that was intended to be - and can be - vectorized while also introducing expensive branches. Fixes llvm#104520 Pre-commit tests at llvm#104527
…104636) Summary: These can be enabled
We need to mask the SRL result to 8 bits before ORing in the SLL. This is needed in case bits 23:16 of the input aren't zero. They will have been shifted into bits 15:8. We don't need to AND the result with 0xffff. It's ok if the upper 16 bits of the register are garbage. Fixes llvm#103035.
… during strictness checks (llvm#104594) Fixes llvm#104570
This reverts commit 43ffe2e. Reason: buildbot breakage starting at https://lab.llvm.org/buildbot/#/builders/85/builds/1102 I manually bisected and found that clang crashed with 43ffe2e but not the immediately preceding commit (3319049)
…01282) Part of llvm#101129 This patch adds support for attaching comments to enums for HTML in clang-doc. It changes the enum generation to table tags where as perviously we're using lists which is more in line with what other doc generators are doing. It also gives clang-doc the ability to show user specified enum values
…vm#104620) In this test `@initializer()` can access globals outside of the module, but Asan does nothing to detect that.
Summary: Currently we have some sema checks to make sure users don't apply kernel-only attributes to non-kernel functions. However, this currently did not correctly check for bare NVPTX / AMDGPU kernel attributes, making it impossible to use them at all w/o CUDA enabled. This patch fixes that by checking for the calling convention / attributes directly.
This patch implements sandboxir::SwitchInst mirroring llvm::SwitchInst.
…lvm#104626) All expansions end with replacing the previous inrinsic with the new expansion and erasing the old one. By moving this operation to the caller, these expansion functions can be called in more contexts and a small amount of duplicated code is consolidated. Pre-req for llvm#88056
…mbols and improve tests
…2078) Address llvm#101550 by adding OwnLineWithBrace option for RequiresClausePosition. This permits placing a following '{' on the same line as the requires clause. Thus, instead of: ``` bool Foo () requires(true) { return true; } ``` we have: ``` bool Foo () requires(true) { return true; } ``` If the function body is empty, we'll get: ``` bool Foo () requires(true) {} ``` I attempted to get a line break between the open and close braces, but failed. Perhaps that's fine -- it's rare and only happens in the empty body case.
We don't have feature implications on any other Zvk extensions and we have error messages in RISCVISAInfo if Zve or V is not enabled. I'm working on testing and refactoring in that code so I'd like to make it consistent.
…llvm#104621) Those modules still can have global constructors and access globals in other modules which are not initialized yet.
This commit adds NVPTX codegen support for brkpt instruction (https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-brkpt) with test under CodeGen/NVPTX/brkpt.ll
Part of an effort to make getConstant stricter about implicit truncation when converting uint64_t to APInt.
… globals" (llvm#104665) Reverts llvm#104621 To many bots are broken, see llvm#104621.
…terializations` (llvm#104630) There was a typo in the code path that removes unnecessary materializations. Before: Update `opResult` (result of an op different from `user`) in mapping and remove `user`. ``` replaceMaterialization(rewriterImpl, opResult, inputOperands, inverseMapping); necessaryMaterializations.remove(materializationOps.lookup(user)); ``` After: Update `user->getResults()` in mapping and remove `user`. ``` replaceMaterialization(rewriterImpl, user->getResults(), inputOperands, inverseMapping); necessaryMaterializations.remove(materializationOps.lookup(user)); ```
…lvm#104395) Prevent operand folding from inlining constants into pseudo scalar transcendental f16 instructions. However still allow literal constants.
Use the nuw attribute of GEPs to prove that pointers do not alias, in cases matching the following: + + + | BaseOffset | +<nuw> Indices | ---------------->|-------------------->| |-->V2Size | |-------> V1Size LHS RHS If the difference between pointers is Offset +<nuw> Indices then we know that the addition does not wrap the pointer index type (add nuw) and the constant Offset is a lower bound on the distance between the pointers. We can then prove NoAlias via Offset u>= V2Size.
…102613) After decomposition of OpenMP compound constructs and assignment of applicable clauses to each leaf construct, composite constructs are then combined again into a single element in the construct queue. This helped later lowering stages easily identify composite constructs. However, as a result of the re-composition stage, the same list of clauses is used to produce all MLIR operations corresponding to each leaf of the original composite construct. This undoes existing logic introducing implicit clauses and deciding to which leaf construct(s) each clause applies. This patch removes construct re-composition logic and updates Flang lowering to be able to identify composite constructs from a list of leaf constructs. As a result, the right set of clauses is produced for each operation representing a leaf of a composite construct. PR stack: - llvm#102612 - llvm#102613
…vm#104595) This new interface is supposed to capture the core functionality of DLTI: querying for values at keys. As such this new interface unifies the ability to query DLTI attributes in a single method: query(). All existing DLTI interfaces exposing their own query methods now 1) now extend this new interface and 2) provide a default implementation for `query()`. As DLTIQueryInterface::query() returns an attribute, it naturally enables recursive queries on nested DLTI attrs. A utility function, `dlti::query()`, implements the logic for nested lookups. A new `#dlti.map` attribute is introduced to capture the most generic form of a finite DLTI-mapping. One of the benefits is that it allows for more easily encoding hierachical information that is suitably queryable, i.e. by means of nested attributes. In line with the above, `transform.dlti.query` is modified so as to take an arbitrary number of keys and to perform a nested lookup using the above utility function.
…atalyst (llvm#104872) Mac Catalyst is the iOS platform, but it builds against the macOS SDK and so it needs to be checking the macOS SDK version instead of the iOS one. Add tests against a greater-than SDK version just to make sure this works beyond the initially supporting SDKs.
…lvm#102300)" This reverts commit b432afc. Reverted due to linker failures in expensive-checks.
…#104805) This extends SimplifyCFG hoisting to also hoist instructions with commuted operands, for example a+b on one side and b+a on the other side. This should address the issue mentioned in: llvm#91185 (comment)
Avoids implicit sint_to_fp which wasn't occurring on strict fp codegen Fixes llvm#104848
…cy zero (llvm#102915) A long time ago (back in 2009) there was a commit 52d4d82 that changed the scheduler to not dirty height/depth when adding or removing SUnit predecessors when the latency on the edge was zero. That commit message is claiming that the depth or height isn't affected when the latency is zero. As a matter of fact, the depth/height can change even with a zero latency on the edge. If for example adding a new SUnit A, with zero latency, but as a predecessor to a SUnit B, then both height of A and depth of B should be marked as dirty. If for example B has a greater height than A, then the height of A needs to be adjusted even if the latency is zero. I think this has been wrong for many years. Downstream we have had commit 52d4d82 reverted since back in 2016. There is no motivating lit test for 52d4d82 (only an incomplete C level reproducer in llvm#3613). After commit 13d04fa there finally appeared an upstream lit test that shows that we get better code if marking height/depth as dirty (llvm/test/CodeGen/AArch64/abds.ll).
…lvm#104781) E.g.: https://godbolt.org/z/G8zK5svjK Based on Evgenii's work.
This change does two kinds of splits: - Splits each target into a different file. Some targets are left in the same files, such as riscv32/64 and x86/_64 as these tests and lists are very similar. - Splits up the very long 'note:' lines which contain a list of CPUs, using `CHECK-SAME`. There was a note about this not being possible before, but with `{{^}}`, this is now possible -- I have verified that this does the right thing if a single CPU anywhere in the list is left out. These tests had become quite annoying to change when adding a CPU, and I believe this change makes these easier to maintain, and should cut down on conflicts in these files (or at least makes conflicts easier to resolve). I apologise in advance for downstream conflicts, but hopefully that's a small amount of short term pain, in return for fewer conflicts in future.
Small PR to add additional getters for LLVMContextRef in the C API.
…lvm#104775) Another upstreaming of C API extensions we have in Julia/LLVM.jl. Although [we went](JuliaLLVM/LLVM.jl#431) with a string-based API there, here I'm proposing something that's similar to existing metadata/attribute APIs: - explicit functions to map syncscope names to IDs, and back - `LLVM*SyncScope` versions of builder APIs that already take a `SingleThread` argument: atomic rmw, atomic xchg, fence - `LLVMGetAtomicSyncScopeID` and `LLVMSetAtomicSyncScopeID` for other atomic instructions - testing through `llvm-c-test`'s `--echo` functionality
Add a hint to use the no-verify-fixpoint option.
These are annoying to update, and are redundant since the tests in clang/test/Driver/print-enabled-extensions/ were added.
) Patterns were previously added to allow the following reductions - fminimum(abs(a), abs(b)) -> famin(a, b) - fmaximum(abs(a), abs(b)) -> famax(a, b) - llvm#103027 It was suggested by @davemgreen that the following reductions are also possible - fminnum[nnan](abs(a), abs(b)) -> famin(a, b) - fmaxnum[nnan](abs(a), abs(b)) -> famax(a, b) ('nnan' documenatation: https://llvm.org/docs/LangRef.html#fast-math-flags) The 'no NaNs' flag allows optimisations to assume that neither argument is a NaN, and so the differing NaN propagation semantics of llvm.maxnum/llvm.minnum and FAMAX/FAMIN can be ignored in this reduction. (llvm.maxnum/llvm.minnum: https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic) - Changes to LLVM - lib/target/AArch64/AArch64InstrInfo.td - add 'fminnm_nnan' and 'fmaxnm_nnan'; patfrags on fminnm/fmaxnm that are predicated on the instrinsic call having the 'nnan' flag. - add AArch64famin and AArch64famax patfrags, containing the new and existing reductions. - test/CodeGen/AArch64/aarch64-neon-faminmax.ll - add positive and negative tests for the new reduction, based on the presence of 'nnan' in the IR intrinsic call.
This patch moves utilities from `offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h` to `llvm/Frontend/Offloading/Utility.h` to be reused by other projects. Concretely the following changes were made: - Rename `KernelMetaDataTy` to `AMDGPUKernelMetaData`. - Remove unused fields `KernelObject`, `KernelSegmentSize`, `ExplicitArgumentCount` and `ImplicitArgumentCount` from `AMDGPUKernelMetaData`. - Return the produced error if `ELFObj.sections()` failed instead of using `cantFail`. - Added `AGPRCount` field to `AMDGPUKernelMetaData`. - Added a default invalid value to all the fields in `AMDGPUKernelMetaData`.
…vm#104692) Inline asm operands could contain any kind of relocation, so remove the checks. Fixes llvm#103493
…e_map (llvm#104918) This test is already disabled for Windows because of symlinks. Disable it for cross build on Windows host too.
This change looks for instructions of storing symmetric constants instruction 32-bit units. usually consisting of several 'MOV' and one or less 'ORR'. If found, load only the lower 32-bit constant and change it to copy and save to the upper 32-bit using the 'STP' instruction. For example: renamable $x8 = MOVZXi 49370, 0 renamable $x8 = MOVKXi $x8, 320, 16 renamable $x8 = ORRXrs $x8, $x8, 32 STRXui killed renamable $x8, killed renamable $x0, 0 becomes $w8 = MOVZWi 49370, 0 $w8 = MOVKWi $w8, 320, 16 STPWi killed renamable $w8, killed renamable $w8, killed renamable $x0, 0 related issue : llvm#51483
…104523) Compilers and language runtimes often use helper functions that are fundamentally uninteresting when debugging anything but the compiler/runtime itself. This patch introduces a user-extensible mechanism that allows for these frames to be hidden from backtraces and automatically skipped over when navigating the stack with `up` and `down`. This does not affect the numbering of frames, so `f <N>` will still provide access to the hidden frames. The `bt` output will also print a hint that frames have been hidden. My primary motivation for this feature is to hide thunks in the Swift programming language, but I'm including an example recognizer for `std::function::operator()` that I wished for myself many times while debugging LLDB. rdar://126629381 Example output. (Yes, my proof-of-concept recognizer could hide even more frames if we had a method that returned the function name without the return type or I used something that isn't based off regex, but it's really only meant as an example). before: ``` (lldb) thread backtrace --filtered=false * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10 frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25 frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12 frame #3: 0x0000000100003968 a.out`std::__1::__function::__alloc_func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()[abi:se200000](this=0x000000016fdff280, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:171:12 frame #4: 0x00000001000026bc a.out`std::__1::__function::__func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()(this=0x000000016fdff278, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:313:10 frame llvm#5: 0x0000000100003c38 a.out`std::__1::__function::__value_func<int (int, int)>::operator()[abi:se200000](this=0x000000016fdff278, __args=0x000000016fdff224, __args=0x000000016fdff220) const at function.h:430:12 frame llvm#6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10 frame llvm#7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10 frame llvm#8: 0x0000000183cdf154 dyld`start + 2476 (lldb) ``` after ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10 frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25 frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12 frame llvm#6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10 frame llvm#7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10 frame llvm#8: 0x0000000183cdf154 dyld`start + 2476 Note: Some frames were hidden by frame recognizers ```
`JITDylibSearchOrderResolver` local variable can be destroyed before completion of all callbacks. Capture it together with `Deps` in `OnEmitted` callback. Original error: ``` ==2035==ERROR: AddressSanitizer: stack-use-after-return on address 0x7bebfa155b70 at pc 0x7ff2a9a88b4a bp 0x7bec08d51980 sp 0x7bec08d51978 READ of size 8 at 0x7bebfa155b70 thread T87 (tf_xla-cpu-llvm) #0 0x7ff2a9a88b49 in operator() llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:58 #1 0x7ff2a9a88b49 in __invoke<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:149:25 #2 0x7ff2a9a88b49 in __call<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:224:5 #3 0x7ff2a9a88b49 in operator() libcxx/include/__functional/function.h:210:12 #4 0x7ff2a9a88b49 in void std::__u::__function::__policy_invoker<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, ```
Static destructor can race with calls to notify and trigger tsan warning. ``` WARNING: ThreadSanitizer: data race (pid=5787) Write of size 1 at 0x55bec9df8de8 by thread T23: #0 pthread_mutex_destroy [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1344](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1344&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12affb) (BuildId: ff25ace8b17d9863348bb1759c47246c) #1 __libcpp_recursive_mutex_destroy [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:91](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=91&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9) (BuildId: ff25ace8b17d9863348bb1759c47246c) #2 std::__tsan::recursive_mutex::~recursive_mutex() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:52](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=52&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9) #3 ~SmartMutex [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:28](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=28&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe) (BuildId: ff25ace8b17d9863348bb1759c47246c) #4 (anonymous namespace)::PerfJITEventListener::~PerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:65](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=65&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe) llvm#5 cxa_at_exit_callback_installed_at(void*) [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:437](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=437&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b172cb9) (BuildId: ff25ace8b17d9863348bb1759c47246c) llvm#6 llvm::JITEventListener::createPerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:496](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=496&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcad8f5) (BuildId: ff25ace8b17d9863348bb1759c47246c) ``` ``` Previous atomic read of size 1 at 0x55bec9df8de8 by thread T192 (mutexes: write M0, write M1): #0 pthread_mutex_unlock [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1387](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1387&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12b6bb) (BuildId: ff25ace8b17d9863348bb1759c47246c) #1 __libcpp_recursive_mutex_unlock [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:87](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=87&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589) (BuildId: ff25ace8b17d9863348bb1759c47246c) #2 std::__tsan::recursive_mutex::unlock() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:64](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=64&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589) #3 unlock [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:47](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=47&cl=669089572):16 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968) (BuildId: ff25ace8b17d9863348bb1759c47246c) #4 ~lock_guard [third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h:39](third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h?l=39&cl=669089572):101 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968) llvm#5 (anonymous namespace)::PerfJITEventListener::notifyObjectLoaded(unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:290](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=290&cl=669089572):1 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968) llvm#6 llvm::orc::RTDyldObjectLinkingLayer::onObjEmit(llvm::orc::MaterializationResponsibility&, llvm::object::OwningBinary<llvm::object::ObjectFile>, std::__tsan::unique_ptr<llvm::RuntimeDyld::MemoryManager, std::__tsan::default_delete<llvm::RuntimeDyld::MemoryManager>>, std::__tsan::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::__tsan::default_delete<llvm::RuntimeDyld::LoadedObjectInfo>>, std::__tsan::unique_ptr<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>, std::__tsan::default_delete<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>>>, llvm::Error) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:386](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp?l=386&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bc404a8) (BuildId: ff25ace8b17d9863348bb1759c47246c) ```
The document part of PR is WIP
@ciyongch @ZhennanQin I cannot add you to the reviewers. Please help to review anyway. :)