Skip to content

Memory planning #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2,245 commits into
base: main
Choose a base branch
from
Open

Memory planning #3

wants to merge 2,245 commits into from

Conversation

Menooker
Copy link
Owner

The document part of PR is WIP

@ciyongch @ZhennanQin I cannot add you to the reviewers. Please help to review anyway. :)

@Menooker
Copy link
Owner Author

Comment on lines 113 to 120
using TraceCollectorFunc = std::function<FailureOr<MemoryTraceScopes>(
Operation *, const BufferViewFlowAnalysis &,
const MergeAllocationOptions &)>;
using MemoryPlannerFunc = std::function<FailureOr<MemorySchedule>(
Operation *, const LifetimeTrace &, const MergeAllocationOptions &)>;
using MemoryMergeMutatorFunc = std::function<LogicalResult(
Operation *toplevel, Operation *scope, const MemorySchedule &,
const MergeAllocationOptions &)>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're proposing an general fwk for merge-alloc, why not considering make these three func pointer to the
equivalent C++ class with necessary interface, then all the customized implementation can implement their own interface?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then there will be no such framework, as different implementations will handle all details internally. I split them into 3 functions to decouple the three stages and and facilitate code reuse.

Comment on lines 23 to 33
func.func @mlp(%x: tensor<128x128xf32>, %y: tensor<128x128xf32>) -> tensor<128x128xf32> {
%a0 = tensor.empty() : tensor<128x128xf32>
%a = linalg.matmul ins(%x, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%a0: tensor<128x128xf32>) -> tensor<128x128xf32>
%b0 = tensor.empty() : tensor<128x128xf32>
%b = linalg.matmul ins(%a, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%b0: tensor<128x128xf32>) -> tensor<128x128xf32>
%c0 = tensor.empty() : tensor<128x128xf32>
%c = linalg.matmul ins(%b, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%c0: tensor<128x128xf32>) -> tensor<128x128xf32>
%d0 = tensor.empty() : tensor<128x128xf32>
%d = linalg.matmul ins(%c, %y: tensor<128x128xf32>, tensor<128x128xf32>) outs(%d0: tensor<128x128xf32>) -> tensor<128x128xf32>
return %d : tensor<128x128xf32>
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can update this example with different buffer size for the tensors (like a0 and c0, etc) or even complex scenario like control flow for better demonstrating the benefit of this proposal?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a simple example to show the core idea of the pass: memory reuse, merged alloc, replace by memref.view

4. Use a "static-memory-planner" to handle the linear timeline

Limitations of Tick-based merge-alloc:
* only contiguous, static shaped and identical layout memrefs are considered.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems these are the common limitation for this fwk, not only this default implementation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we don't need limit that. If an implementation can magically schedule the dynamic shaped tensor (e.g. by considering the additionaly restrictions of the shapes), it is possible to schedule the buffers.

```mlir
func.func @mlp(...) { // <---- alloc scope 1
scf.for(...) { // <---- NOT an alloc scope!
// allocation inside will be merge to alloc scope 1 above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rewrite order of nested alloc scope? How do you handle the new buffer which is hoisted out from scf.for?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hoisting is "one-shot" - it is decided by the getAllocationScope. There will be no second run of hoisting.

in cache. In the default configuration (when `no-consider-locality` option is
not specified to the merge-alloc pass), static memory planner considers both
cache-locality and the degree of matching of allocation size and the chunk size
for each free memory chunks, with a simple cost-model. With
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"cost-model" might be too heavy? use "heuristic rule" instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to rename the pass option as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about locality-and-size and size-only?

Comment on lines 64 to 66
opt.tracer = memref::TickCollecter();
opt.planner = memref::tickBasedPlanMemory;
opt.mutator = memref::MergeAllocDefaultMutator();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to move the default implementation into a sub namespace of memref?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it is enough to put it in memref namespace? Maybe we can let the community decide?

Operation *op) const {
auto parent = op;
for (;;) {
parent = parent->getParentWithTrait<OpTrait::AutomaticAllocationScope>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this correctly handle sth like below:

scf.forall(...)
  ...
  scf.for(...)
     scf.if(...)
		%buf = memref.alloc() : 

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, scf.if is not a AutomaticAllocationScope, the pass will find the scf.forall instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need special handle as well for the allocation happened inside scf.if? as it might not always true that we need to allocate this buffer, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. scf.if is RegionBranchOpInterface which we also take care of. :)

krzysz00 and others added 26 commits August 16, 2024 16:24
In cases (like the ones added in the tests) where the condition of a
masked load or store is a splat but not a constant (that is, a masked
operation is being used to implement patterns like "load if the current
lane is in-bounds, otherwise return 0"), optimize the 'scalarized' code
to perform an aligned vector load/store if the splat constant is true.

Additionally, take a few steps to preserve aliasing information and
names when nothing is scalarized while I'm here.

As motivation, some LLVM IR users will genatate masked load/store in
cases that map to this kind of predicated operation (where either the
vector is loaded/stored or it isn't) in order to take advantage of
hardware primitives, but on AMDGPU, where we don't have a masked load or
store, this pass would scalarize a load or store that was intended to be
- and can be - vectorized while also introducing expensive branches.

Fixes llvm#104520

Pre-commit tests at llvm#104527
We need to mask the SRL result to 8 bits before ORing in the SLL. This
is needed in case bits 23:16 of the input aren't zero. They will have
been shifted into bits 15:8.

We don't need to AND the result with 0xffff. It's ok if the upper 16
bits of the register are garbage.

Fixes llvm#103035.
This reverts commit 43ffe2e.

Reason: buildbot breakage starting at https://lab.llvm.org/buildbot/#/builders/85/builds/1102

I manually bisected and found that clang crashed with 43ffe2e but not the immediately preceding commit (3319049)
…01282)

Part of llvm#101129

This patch adds support for attaching comments to enums for HTML in
clang-doc. It changes the enum generation to table tags where as
perviously we're using lists which is more in line with what other doc
generators are doing. It also gives clang-doc the ability to show user
specified enum values
…vm#104620)

In this test `@initializer()` can access globals
outside of the module, but Asan does nothing to
detect that.
Summary:
Currently we have some sema checks to make sure users don't apply
kernel-only attributes to non-kernel functions. However, this currently
did not correctly check for bare NVPTX / AMDGPU kernel attributes,
making it impossible to use them at all w/o CUDA enabled. This patch
fixes that by checking for the calling convention / attributes directly.
This patch implements sandboxir::SwitchInst mirroring llvm::SwitchInst.
…lvm#104626)

All expansions end with replacing the previous inrinsic with the new
expansion and erasing the old one. By moving this operation to the
caller, these expansion functions can be called in more contexts and a
small amount of duplicated code is consolidated.

Pre-req for llvm#88056
…2078)

Address llvm#101550 by adding OwnLineWithBrace option for RequiresClausePosition. This permits placing a following '{' on the same line as the requires clause.

Thus, instead of:
```
bool Foo ()
  requires(true)
{
  return true;
}
```

we have:
```
bool Foo ()
  requires(true) {
  return true;
}
```

If the function body is empty, we'll get:
```
bool Foo ()
  requires(true) {}
```

I attempted to get a line break between the open and close braces, but
failed. Perhaps that's fine -- it's rare and only happens in the empty
body case.
We don't have feature implications on any other Zvk extensions and
we have error messages in RISCVISAInfo if Zve or V is not enabled.
I'm working on testing and refactoring in that code so I'd like to
make it consistent.
…llvm#104621)

Those modules still can have global constructors and access
globals in other modules which are not initialized yet.
This commit adds NVPTX codegen support for brkpt instruction
(https://docs.nvidia.com/cuda/parallel-thread-execution/#miscellaneous-instructions-brkpt)
with test under CodeGen/NVPTX/brkpt.ll
Part of an effort to make getConstant stricter about implicit truncation
when converting uint64_t to APInt.
…terializations` (llvm#104630)

There was a typo in the code path that removes unnecessary
materializations.

Before: Update `opResult` (result of an op different from `user`) in
mapping and remove `user`.
```
replaceMaterialization(rewriterImpl, opResult, inputOperands,
                       inverseMapping);
necessaryMaterializations.remove(materializationOps.lookup(user));
```

After: Update `user->getResults()` in mapping and remove `user`.
```
replaceMaterialization(rewriterImpl, user->getResults(), inputOperands,
                       inverseMapping);
necessaryMaterializations.remove(materializationOps.lookup(user));
```
…lvm#104395)

Prevent operand folding from inlining constants into pseudo scalar
transcendental f16 instructions.
However still allow literal constants.
hazzlim and others added 27 commits August 20, 2024 11:00
Use the nuw attribute of GEPs to prove that pointers do not alias, in
cases matching the following:

   +                +                     +
   | BaseOffset     |   +<nuw> Indices    |
   ---------------->|-------------------->|
   |-->V2Size       |                     |-------> V1Size
  LHS                                    RHS

If the difference between pointers is Offset +<nuw> Indices then we know
that the addition does not wrap the pointer index type (add nuw) and the
constant Offset is a lower bound on the distance between the pointers. We
can then prove NoAlias via Offset u>= V2Size.
…102613)

After decomposition of OpenMP compound constructs and assignment of
applicable clauses to each leaf construct, composite constructs are then
combined again into a single element in the construct queue. This helped
later lowering stages easily identify composite constructs.

However, as a result of the re-composition stage, the same list of
clauses is used to produce all MLIR operations corresponding to each
leaf of the original composite construct. This undoes existing logic
introducing implicit clauses and deciding to which leaf construct(s)
each clause applies.

This patch removes construct re-composition logic and updates Flang
lowering to be able to identify composite constructs from a list of leaf
constructs. As a result, the right set of clauses is produced for each
operation representing a leaf of a composite construct.

PR stack:
- llvm#102612
- llvm#102613
…vm#104595)

This new interface is supposed to capture the core functionality of
DLTI: querying for values at keys. As such this new interface unifies
the ability to query DLTI attributes in a single method: query(). All
existing DLTI interfaces exposing their own query methods now 1) now
extend this new interface and 2) provide a default implementation for
`query()`.

As DLTIQueryInterface::query() returns an attribute, it naturally
enables recursive queries on nested DLTI attrs. A utility function,
`dlti::query()`, implements the logic for nested lookups.

A new `#dlti.map` attribute is introduced to capture the most generic
form of a finite DLTI-mapping. One of the benefits is that it allows for
more easily encoding hierachical information that is suitably queryable,
i.e. by means of nested attributes.

In line with the above, `transform.dlti.query` is modified so as to take
an arbitrary number of keys and to perform a nested lookup using the
above utility function.
…atalyst (llvm#104872)

Mac Catalyst is the iOS platform, but it builds against the macOS SDK
and so it needs to be checking the macOS SDK version instead of the iOS
one. Add tests against a greater-than SDK version just to make sure this
works beyond the initially supporting SDKs.
…lvm#102300)"

This reverts commit b432afc.

Reverted due to linker failures in expensive-checks.
…#104805)

This extends SimplifyCFG hoisting to also hoist instructions with
commuted operands, for example a+b on one side and b+a on the other
side.

This should address the issue mentioned in:
llvm#91185 (comment)
Avoids implicit sint_to_fp which wasn't occurring on strict fp codegen

Fixes llvm#104848
…cy zero (llvm#102915)

A long time ago (back in 2009) there was a commit 52d4d82
that changed the scheduler to not dirty height/depth when adding or
removing SUnit predecessors when the latency on the edge was zero. That
commit message is claiming that the depth or height isn't affected when
the latency is zero.

As a matter of fact, the depth/height can change even with a zero
latency on the edge. If for example adding a new SUnit A, with zero
latency, but as a predecessor to a SUnit B, then both height of A and
depth of B should be marked as dirty. If for example B has a greater
height than A, then the height of A needs to be adjusted even if the
latency is zero.

I think this has been wrong for many years. Downstream we have had
commit 52d4d82 reverted since back in 2016. There is no
motivating lit test for 52d4d82 (only an incomplete C level
reproducer in llvm#3613).

After commit 13d04fa there finally appeared an upstream
lit test that shows that we get better code if marking height/depth as
dirty (llvm/test/CodeGen/AArch64/abds.ll).
This change does two kinds of splits:
- Splits each target into a different file. Some targets are left in the
same files, such as riscv32/64 and x86/_64 as these tests and lists are
very similar.
- Splits up the very long 'note:' lines which contain a list of CPUs,
using `CHECK-SAME`. There was a note about this not being possible
before, but with `{{^}}`, this is now possible -- I have
verified that this does the right thing if a single CPU anywhere in the
list is left out.

These tests had become quite annoying to change when adding a CPU, and I
believe this change makes these easier to maintain, and should cut down
on conflicts in these files (or at least makes conflicts easier to
resolve).

I apologise in advance for downstream conflicts, but hopefully that's a
small amount of short term pain, in return for fewer conflicts in
future.
Small PR to add additional getters for LLVMContextRef in the C API.
…lvm#104775)

Another upstreaming of C API extensions we have in Julia/LLVM.jl.
Although [we went](JuliaLLVM/LLVM.jl#431) with a
string-based API there, here I'm proposing something that's similar to
existing metadata/attribute APIs:
- explicit functions to map syncscope names to IDs, and back
- `LLVM*SyncScope` versions of builder APIs that already take a
`SingleThread` argument: atomic rmw, atomic xchg, fence
- `LLVMGetAtomicSyncScopeID` and `LLVMSetAtomicSyncScopeID` for other
atomic instructions
- testing through `llvm-c-test`'s `--echo` functionality
Add a hint to use the no-verify-fixpoint option.
These are annoying to update, and are redundant since the tests in
clang/test/Driver/print-enabled-extensions/ were added.
)

Patterns were previously added to allow the following reductions
- fminimum(abs(a), abs(b)) -> famin(a, b)
- fmaximum(abs(a), abs(b)) -> famax(a, b)
- llvm#103027
 
It was suggested by @davemgreen that the following reductions are also
possible
- fminnum[nnan](abs(a), abs(b)) -> famin(a, b)
- fmaxnum[nnan](abs(a), abs(b)) -> famax(a, b)

('nnan' documenatation:
https://llvm.org/docs/LangRef.html#fast-math-flags)

The 'no NaNs' flag allows optimisations to assume that neither argument
is a NaN, and so the differing NaN propagation semantics of
llvm.maxnum/llvm.minnum and FAMAX/FAMIN can be ignored in this
reduction.
(llvm.maxnum/llvm.minnum:
https://llvm.org/docs/LangRef.html#llvm-minnum-intrinsic)

- Changes to LLVM
	- lib/target/AArch64/AArch64InstrInfo.td
- add 'fminnm_nnan' and 'fmaxnm_nnan'; patfrags on fminnm/fmaxnm that
are predicated on the instrinsic call having the 'nnan' flag.
- add AArch64famin and AArch64famax patfrags, containing the new and
existing reductions.
	- test/CodeGen/AArch64/aarch64-neon-faminmax.ll
- add positive and negative tests for the new reduction, based on the
presence of 'nnan' in the IR intrinsic call.
This patch moves utilities from
`offload/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h` to
`llvm/Frontend/Offloading/Utility.h` to be reused by
other projects.

Concretely the following changes were made:
- Rename `KernelMetaDataTy` to `AMDGPUKernelMetaData`.
- Remove unused fields `KernelObject`, `KernelSegmentSize`,
`ExplicitArgumentCount` and `ImplicitArgumentCount` from
`AMDGPUKernelMetaData`.
- Return the produced error if `ELFObj.sections()` failed instead of
using `cantFail`.
- Added `AGPRCount` field to `AMDGPUKernelMetaData`.
- Added a default invalid value to all the fields in
`AMDGPUKernelMetaData`.
…vm#104692)

Inline asm operands could contain any kind of relocation, so remove the
checks.

Fixes llvm#103493
…e_map (llvm#104918)

This test is already disabled for Windows because of symlinks. Disable
it for cross build on Windows host too.
This change looks for instructions of storing symmetric constants
instruction 32-bit units. usually consisting of several 'MOV' and
one or less 'ORR'.

If found, load only the lower 32-bit constant and change it to copy
and save to the upper 32-bit using the 'STP' instruction.

For example:
  renamable $x8 = MOVZXi 49370, 0
  renamable $x8 = MOVKXi $x8, 320, 16
  renamable $x8 = ORRXrs $x8, $x8, 32
  STRXui killed renamable $x8, killed renamable $x0, 0
becomes
  $w8 = MOVZWi 49370, 0
  $w8 = MOVKWi $w8, 320, 16
STPWi killed renamable $w8, killed renamable $w8, killed renamable $x0,
0


related issue : llvm#51483
Menooker pushed a commit that referenced this pull request Sep 3, 2024
…104523)

Compilers and language runtimes often use helper functions that are
fundamentally uninteresting when debugging anything but the
compiler/runtime itself. This patch introduces a user-extensible
mechanism that allows for these frames to be hidden from backtraces and
automatically skipped over when navigating the stack with `up` and
`down`.

This does not affect the numbering of frames, so `f <N>` will still
provide access to the hidden frames. The `bt` output will also print a
hint that frames have been hidden.

My primary motivation for this feature is to hide thunks in the Swift
programming language, but I'm including an example recognizer for
`std::function::operator()` that I wished for myself many times while
debugging LLDB.

rdar://126629381


Example output. (Yes, my proof-of-concept recognizer could hide even
more frames if we had a method that returned the function name without
the return type or I used something that isn't based off regex, but it's
really only meant as an example).

before:
```
(lldb) thread backtrace --filtered=false
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10
    frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25
    frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12
    frame #3: 0x0000000100003968 a.out`std::__1::__function::__alloc_func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()[abi:se200000](this=0x000000016fdff280, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:171:12
    frame #4: 0x00000001000026bc a.out`std::__1::__function::__func<int (*)(int, int), std::__1::allocator<int (*)(int, int)>, int (int, int)>::operator()(this=0x000000016fdff278, __arg=0x000000016fdff224, __arg=0x000000016fdff220) at function.h:313:10
    frame llvm#5: 0x0000000100003c38 a.out`std::__1::__function::__value_func<int (int, int)>::operator()[abi:se200000](this=0x000000016fdff278, __args=0x000000016fdff224, __args=0x000000016fdff220) const at function.h:430:12
    frame llvm#6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10
    frame llvm#7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10
    frame llvm#8: 0x0000000183cdf154 dyld`start + 2476
(lldb) 
```

after

```
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100001f04 a.out`foo(x=1, y=1) at main.cpp:4:10
    frame #1: 0x0000000100003a00 a.out`decltype(std::declval<int (*&)(int, int)>()(std::declval<int>(), std::declval<int>())) std::__1::__invoke[abi:se200000]<int (*&)(int, int), int, int>(__f=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:149:25
    frame #2: 0x000000010000399c a.out`int std::__1::__invoke_void_return_wrapper<int, false>::__call[abi:se200000]<int (*&)(int, int), int, int>(__args=0x000000016fdff280, __args=0x000000016fdff224, __args=0x000000016fdff220) at invoke.h:216:12
    frame llvm#6: 0x0000000100002038 a.out`std::__1::function<int (int, int)>::operator()(this= Function = foo(int, int) , __arg=1, __arg=1) const at function.h:989:10
    frame llvm#7: 0x0000000100001f64 a.out`main(argc=1, argv=0x000000016fdff4f8) at main.cpp:9:10
    frame llvm#8: 0x0000000183cdf154 dyld`start + 2476
Note: Some frames were hidden by frame recognizers
```
Menooker pushed a commit that referenced this pull request Sep 3, 2024
`JITDylibSearchOrderResolver` local variable can be destroyed before
completion of all callbacks. Capture it together with `Deps` in
`OnEmitted` callback.

Original error:

```
==2035==ERROR: AddressSanitizer: stack-use-after-return on address 0x7bebfa155b70 at pc 0x7ff2a9a88b4a bp 0x7bec08d51980 sp 0x7bec08d51978
READ of size 8 at 0x7bebfa155b70 thread T87 (tf_xla-cpu-llvm)
    #0 0x7ff2a9a88b49 in operator() llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:58
    #1 0x7ff2a9a88b49 in __invoke<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:149:25
    #2 0x7ff2a9a88b49 in __call<(lambda at llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:55:9) &, const llvm::DenseMap<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> >, llvm::DenseMapInfo<llvm::orc::JITDylib *, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib *, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void> > > > &> libcxx/include/__type_traits/invoke.h:224:5
    #3 0x7ff2a9a88b49 in operator() libcxx/include/__functional/function.h:210:12
    #4 0x7ff2a9a88b49 in void std::__u::__function::__policy_invoker<void (llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr,
```
Menooker pushed a commit that referenced this pull request Sep 3, 2024
Static destructor can race with calls to notify and trigger tsan
warning.

```
WARNING: ThreadSanitizer: data race (pid=5787)
  Write of size 1 at 0x55bec9df8de8 by thread T23:
    #0 pthread_mutex_destroy [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1344](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1344&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12affb) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #1 __libcpp_recursive_mutex_destroy [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:91](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=91&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #2 std::__tsan::recursive_mutex::~recursive_mutex() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:52](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=52&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d4e9)
    #3 ~SmartMutex [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:28](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=28&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #4 (anonymous namespace)::PerfJITEventListener::~PerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:65](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=65&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaedfe)
    llvm#5 cxa_at_exit_callback_installed_at(void*) [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:437](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=437&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b172cb9) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    llvm#6 llvm::JITEventListener::createPerfJITEventListener() [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:496](third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=496&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcad8f5) (BuildId: ff25ace8b17d9863348bb1759c47246c)
```
```
Previous atomic read of size 1 at 0x55bec9df8de8 by thread T192 (mutexes: write M0, write M1):
    #0 pthread_mutex_unlock [third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1387](third_party/llvm/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp?l=1387&cl=669089572):3 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x1b12b6bb) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #1 __libcpp_recursive_mutex_unlock [third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h:87](third_party/crosstool/v18/stable/src/libcxx/include/__thread/support/pthread.h?l=87&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #2 std::__tsan::recursive_mutex::unlock() [third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp:64](third_party/crosstool/v18/stable/src/libcxx/src/mutex.cpp?l=64&cl=669089572):11 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x4523d589)
    #3 unlock [third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h:47](third_party/llvm/llvm-project/llvm/include/llvm/Support/Mutex.h?l=47&cl=669089572):16 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968) (BuildId: ff25ace8b17d9863348bb1759c47246c)
    #4 ~lock_guard [third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h:39](third_party/crosstool/v18/stable/src/libcxx/include/__mutex/lock_guard.h?l=39&cl=669089572):101 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968)
    llvm#5 (anonymous namespace)::PerfJITEventListener::notifyObjectLoaded(unsigned long, llvm::object::ObjectFile const&, llvm::RuntimeDyld::LoadedObjectInfo const&) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp:290](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/PerfJITEvents/PerfJITEventListener.cpp?l=290&cl=669089572):1 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bcaf968)
    llvm#6 llvm::orc::RTDyldObjectLinkingLayer::onObjEmit(llvm::orc::MaterializationResponsibility&, llvm::object::OwningBinary<llvm::object::ObjectFile>, std::__tsan::unique_ptr<llvm::RuntimeDyld::MemoryManager, std::__tsan::default_delete<llvm::RuntimeDyld::MemoryManager>>, std::__tsan::unique_ptr<llvm::RuntimeDyld::LoadedObjectInfo, std::__tsan::default_delete<llvm::RuntimeDyld::LoadedObjectInfo>>, std::__tsan::unique_ptr<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>, std::__tsan::default_delete<llvm::DenseMap<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>, llvm::DenseMapInfo<llvm::orc::JITDylib*, void>, llvm::detail::DenseMapPair<llvm::orc::JITDylib*, llvm::DenseSet<llvm::orc::SymbolStringPtr, llvm::DenseMapInfo<llvm::orc::SymbolStringPtr, void>>>>>>, llvm::Error) [third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp:386](https://cs.corp.google.com/piper///depot/google3/third_party/llvm/llvm-project/llvm/lib/ExecutionEngine/Orc/RTDyldObjectLinkingLayer.cpp?l=386&cl=669089572):10 (be1eb158bb70fc9cf7be2db70407e512890e5c6e20720cd88c69d7d9c26ea531_0200d5f71908+0x2bc404a8) (BuildId: ff25ace8b17d9863348bb1759c47246c)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.