-
Notifications
You must be signed in to change notification settings - Fork 0
[SYCL] Implement hierarchical parallelism API. #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
This is the first part of SYCL hierarchical parallelism implementation. It implements main related APIs: - h_item class - group::parallel_for_work_item functions - handler::parallel_for_work_group functions It is able to run workloads which use these APIs but do not contain data or code with group-visible side effects between the work group and work item scopes. Signed-off-by: Konstantin S Bobrovsky <konstantin.s.bobrovsky@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good.
A few questions/comments nevertheless.
@@ -358,7 +353,7 @@ class accessor : | |||
template <int Dims = Dimensions> | |||
accessor( | |||
buffer<DataT, 1> &BufferRef, | |||
enable_if_t<(!IsPlaceH && (IsGlobalBuf || IsConstantBuf)) && Dims == 0, | |||
detail::enable_if_t<(!IsPlaceH && (IsGlobalBuf || IsConstantBuf)) && Dims == 0, | |||
handler> &CommandGroupHandler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding detail::
changed most of the indentation on the next line(s).
Look at also the next places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK will apply clang-format.
@@ -398,7 +393,7 @@ class accessor : | |||
#endif | |||
|
|||
template <int Dims = Dimensions, | |||
typename = enable_if_t< | |||
typename = detail::enable_if_t< | |||
(!IsPlaceH && (IsGlobalBuf || IsConstantBuf)) && (Dims > 0)>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example here...
|
||
setNDRangeLeftover<Dims_>(); | ||
template <int Dims_> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment will be appreciated. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@@ -72,14 +87,42 @@ class NDRDescT { | |||
GlobalSize[I] = ExecutionRange.get_global_range()[I]; | |||
LocalSize[I] = ExecutionRange.get_local_range()[I]; | |||
GlobalOffset[I] = ExecutionRange.get_offset()[I]; | |||
NumWorkGroups[I] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I read the code, I wonder what is NumWorkGroups
here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumWorkGroups field is commented. Do you want also a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is enough then.
void call(const NDRDescT &NDRDesc) override { | ||
// adjust ND range for serial host: | ||
NDRDescT R1; | ||
bool Adjust = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea about what it is for...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, will add more comments
h_item<dimensions> hItem = | ||
detail::Builder::createHItem<dimensions>(globalItem, localItem); | ||
|
||
// iterate over flexible range with work group size stride; each item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a stride here? I thought the OpenCL range model was more about iterating by block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, to preserve semantics of the flexible range.
}); | ||
}); | ||
#endif // __SYCL_DEVICE_ONLY__ | ||
detail::workGroupBarrier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need yet-another-barrier at the end? Just one either at the beginning or at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There can be WG-scope code after (syntactically or dynamically - e.g. via a loop) the parallel_for_work_item which reads work-group local data written within this PFWI. I'll add a comment
kernel_parallel_for_work_group<NameT, KernelType, Dims>(KernelFunc); | ||
#else | ||
MNDRDesc.setNumWorkGroups(NumWorkGroups); | ||
StoreLambda<NameT, KernelType, Dims>(std::move(KernelFunc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps std::forward
and KernelType && KernelFunc
above or something like that for the perfect forwarding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I just followed the pattern set by other invocation APIs here. Should we file an issue here and fix all at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is probably a more global issue and in the meantime you can skip the std::move
.
The compile can figure out this std::move
by itself anyway since it is the last use of the variable passed by copy, I think.
#else | ||
MNDRDesc.setNumWorkGroups(NumWorkGroups); | ||
MSyclKernel = detail::getSyclObjImpl(std::move(SyclKernel)); | ||
StoreLambda<NameT, KernelType, Dims>(std::move(KernelFunc)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I have the feeling that these std::move
are early optimization for now we can look at later, with the big picture.
for (int I = Dims_; I < 3; ++I) { | ||
GlobalSize[I] = 1; | ||
LocalSize[I] = LocalSize[0] ? 1 : 0; | ||
GlobalOffset[I] = 0; | ||
NumWorkGroups[I] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how the compiler will optimize all these redundant allocations in the case of Dims = 1 or 2, instead of having exactly the right size...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there maybe some small inefficiency, but it is host-side only and we "buy" independence of template, as @romanovvlad commented in the original review. Do you think we need to file an issue on this?
Since the enabling fixes for hierarchical parallelism API have been merged, move the review back to the public repo per @bader 's suggestion. Sorry for inconvenience. |
OK moving somewhere else again then... |
Syntax: asm [volatile] goto ( AssemblerTemplate : : InputOperands : Clobbers : GotoLabels) https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html New llvm IR is "callbr" for inline asm goto instead "call" for inline asm For: asm goto("testl %0, %0; jne %l1;" :: "r"(cond)::label_true, loop); IR: callbr void asm sideeffect "testl $0, $0; jne ${1:l};", "r,X,X,~{dirflag},~{fpsr},~{flags}"(i32 %0, i8* blockaddress(@foo, %label_true), i8* blockaddress(@foo, %loop)) #1 to label %asm.fallthrough [label %label_true, label %loop], !srcloc !3 asm.fallthrough: Compiler need to generate: 1> a dummy constarint 'X' for each label. 2> an unique fallthrough label for each asm goto stmt " asm.fallthrough%number". Diagnostic 1> duplicate asm operand name are used in output, input and label. 2> goto out of scope. llvm-svn: 362045
Introduction ============ This patch added intial support for bpf program compile once and run everywhere (CO-RE). The main motivation is for bpf program which depends on kernel headers which may vary between different kernel versions. The initial discussion can be found at https://lwn.net/Articles/773198/. Currently, bpf program accesses kernel internal data structure through bpf_probe_read() helper. The idea is to capture the kernel data structure to be accessed through bpf_probe_read() and relocate them on different kernel versions. On each host, right before bpf program load, the bpfloader will look at the types of the native linux through vmlinux BTF, calculates proper access offset and patch the instruction. To accommodate this, three intrinsic functions preserve_{array,union,struct}_access_index are introduced which in clang will preserve the base pointer, struct/union/array access_index and struct/union debuginfo type information. Later, bpf IR pass can reconstruct the whole gep access chains without looking at gep itself. This patch did the following: . An IR pass is added to convert preserve_*_access_index to global variable who name encodes the getelementptr access pattern. The global variable has metadata attached to describe the corresponding struct/union debuginfo type. . An SimplifyPatchable MachineInstruction pass is added to remove unnecessary loads. . The BTF output pass is enhanced to generate relocation records located in .BTF.ext section. Typical CO-RE also needs support of global variables which can be assigned to different values to different hosts. For example, kernel version can be used to guard different versions of codes. This patch added the support for patchable externals as well. Example ======= The following is an example. struct pt_regs { long arg1; long arg2; }; struct sk_buff { int i; struct net_device *dev; }; #define _(x) (__builtin_preserve_access_index(x)) static int (*bpf_probe_read)(void *dst, int size, const void *unsafe_ptr) = (void *) 4; extern __attribute__((section(".BPF.patchable_externs"))) unsigned __kernel_version; int bpf_prog(struct pt_regs *ctx) { struct net_device *dev = 0; // ctx->arg* does not need bpf_probe_read if (__kernel_version >= 41608) bpf_probe_read(&dev, sizeof(dev), _(&((struct sk_buff *)ctx->arg1)->dev)); else bpf_probe_read(&dev, sizeof(dev), _(&((struct sk_buff *)ctx->arg2)->dev)); return dev != 0; } In the above, we want to translate the third argument of bpf_probe_read() as relocations. -bash-4.4$ clang -target bpf -O2 -g -S trace.c The compiler will generate two new subsections in .BTF.ext, OffsetReloc and ExternReloc. OffsetReloc is to record the structure member offset operations, and ExternalReloc is to record the external globals where only u8, u16, u32 and u64 are supported. BPFOffsetReloc Size struct SecLOffsetReloc for ELF section #1 A number of struct BPFOffsetReloc for ELF section #1 struct SecOffsetReloc for ELF section #2 A number of struct BPFOffsetReloc for ELF section #2 ... BPFExternReloc Size struct SecExternReloc for ELF section #1 A number of struct BPFExternReloc for ELF section #1 struct SecExternReloc for ELF section #2 A number of struct BPFExternReloc for ELF section #2 struct BPFOffsetReloc { uint32_t InsnOffset; ///< Byte offset in this section uint32_t TypeID; ///< TypeID for the relocation uint32_t OffsetNameOff; ///< The string to traverse types }; struct BPFExternReloc { uint32_t InsnOffset; ///< Byte offset in this section uint32_t ExternNameOff; ///< The string for external variable }; Note that only externs with attribute section ".BPF.patchable_externs" are considered for Extern Reloc which will be patched by bpf loader right before the load. For the above test case, two offset records and one extern record will be generated: OffsetReloc records: .long .Ltmp12 # Insn Offset .long 7 # TypeId .long 242 # Type Decode String .long .Ltmp18 # Insn Offset .long 7 # TypeId .long 242 # Type Decode String ExternReloc record: .long .Ltmp5 # Insn Offset .long 165 # External Variable In string table: .ascii "0:1" # string offset=242 .ascii "__kernel_version" # string offset=165 The default member offset can be calculated as the 2nd member offset (0 representing the 1st member) of struct "sk_buff". The asm code: .Ltmp5: .Ltmp6: r2 = 0 r3 = 41608 .Ltmp7: .Ltmp8: .loc 1 18 9 is_stmt 0 # t.c:18:9 .Ltmp9: if r3 > r2 goto LBB0_2 .Ltmp10: .Ltmp11: .loc 1 0 9 # t.c:0:9 .Ltmp12: r2 = 8 .Ltmp13: .loc 1 19 66 is_stmt 1 # t.c:19:66 .Ltmp14: .Ltmp15: r3 = *(u64 *)(r1 + 0) goto LBB0_3 .Ltmp16: .Ltmp17: LBB0_2: .loc 1 0 66 is_stmt 0 # t.c:0:66 .Ltmp18: r2 = 8 .loc 1 21 66 is_stmt 1 # t.c:21:66 .Ltmp19: r3 = *(u64 *)(r1 + 8) .Ltmp20: .Ltmp21: LBB0_3: .loc 1 0 66 is_stmt 0 # t.c:0:66 r3 += r2 r1 = r10 .Ltmp22: .Ltmp23: .Ltmp24: r1 += -8 r2 = 8 call 4 For instruction .Ltmp12 and .Ltmp18, "r2 = 8", the number 8 is the structure offset based on the current BTF. Loader needs to adjust it if it changes on the host. For instruction .Ltmp5, "r2 = 0", the external variable got a default value 0, loader needs to supply an appropriate value for the particular host. Compiling to generate object code and disassemble: 0000000000000000 bpf_prog: 0: b7 02 00 00 00 00 00 00 r2 = 0 1: 7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2 2: b7 02 00 00 00 00 00 00 r2 = 0 3: b7 03 00 00 88 a2 00 00 r3 = 41608 4: 2d 23 03 00 00 00 00 00 if r3 > r2 goto +3 <LBB0_2> 5: b7 02 00 00 08 00 00 00 r2 = 8 6: 79 13 00 00 00 00 00 00 r3 = *(u64 *)(r1 + 0) 7: 05 00 02 00 00 00 00 00 goto +2 <LBB0_3> 0000000000000040 LBB0_2: 8: b7 02 00 00 08 00 00 00 r2 = 8 9: 79 13 08 00 00 00 00 00 r3 = *(u64 *)(r1 + 8) 0000000000000050 LBB0_3: 10: 0f 23 00 00 00 00 00 00 r3 += r2 11: bf a1 00 00 00 00 00 00 r1 = r10 12: 07 01 00 00 f8 ff ff ff r1 += -8 13: b7 02 00 00 08 00 00 00 r2 = 8 14: 85 00 00 00 04 00 00 00 call 4 Instructions #2, intel#5 and intel#8 need relocation resoutions from the loader. Signed-off-by: Yonghong Song <yhs@fb.com> Differential Revision: https://reviews.llvm.org/D61524 llvm-svn: 365503
…t binding This fixes a failing testcase on Fedora 30 x86_64 (regression Fedora 29->30): PASS: ./bin/lldb ./lldb-test-build.noindex/functionalities/unwind/noreturn/TestNoreturnUnwind.test_dwarf/a.out -o 'settings set symbols.enable-external-lookup false' -o r -o bt -o quit * frame #0: 0x00007ffff7aa6e75 libc.so.6`__GI_raise + 325 frame #1: 0x00007ffff7a91895 libc.so.6`__GI_abort + 295 frame #2: 0x0000000000401140 a.out`func_c at main.c:12:2 frame intel#3: 0x000000000040113a a.out`func_b at main.c:18:2 frame intel#4: 0x0000000000401134 a.out`func_a at main.c:26:2 frame intel#5: 0x000000000040112e a.out`main(argc=<unavailable>, argv=<unavailable>) at main.c:32:2 frame intel#6: 0x00007ffff7a92f33 libc.so.6`__libc_start_main + 243 frame intel#7: 0x000000000040106e a.out`_start + 46 vs. FAIL - unrecognized abort() function: ./bin/lldb ./lldb-test-build.noindex/functionalities/unwind/noreturn/TestNoreturnUnwind.test_dwarf/a.out -o 'settings set symbols.enable-external-lookup false' -o r -o bt -o quit * frame #0: 0x00007ffff7aa6e75 libc.so.6`.annobin_raise.c + 325 frame #1: 0x00007ffff7a91895 libc.so.6`.annobin_loadmsgcat.c_end.unlikely + 295 frame #2: 0x0000000000401140 a.out`func_c at main.c:12:2 frame intel#3: 0x000000000040113a a.out`func_b at main.c:18:2 frame intel#4: 0x0000000000401134 a.out`func_a at main.c:26:2 frame intel#5: 0x000000000040112e a.out`main(argc=<unavailable>, argv=<unavailable>) at main.c:32:2 frame intel#6: 0x00007ffff7a92f33 libc.so.6`.annobin_libc_start.c + 243 frame intel#7: 0x000000000040106e a.out`.annobin_init.c.hot + 46 The extra ELF symbols are there due to Annobin (I did not investigate why this problem happened specifically since F-30 and not since F-28). It is due to: Symbol table '.dynsym' contains 2361 entries: Valu e Size Type Bind Vis Name 0000000000022769 5 FUNC LOCAL DEFAULT _nl_load_domain.cold 000000000002276e 0 NOTYPE LOCAL HIDDEN .annobin_abort.c.unlikely ... 000000000002276e 0 NOTYPE LOCAL HIDDEN .annobin_loadmsgcat.c_end.unlikely ... 000000000002276e 0 NOTYPE LOCAL HIDDEN .annobin_textdomain.c_end.unlikely 000000000002276e 548 FUNC GLOBAL DEFAULT abort 000000000002276e 548 FUNC GLOBAL DEFAULT abort@@GLIBC_2.2.5 000000000002276e 548 FUNC LOCAL DEFAULT __GI_abort 0000000000022992 0 NOTYPE LOCAL HIDDEN .annobin_abort.c_end.unlikely GDB has some more complicated preferences between overlapping and/or sharing address symbols, I have made here so far the most simple fix for this case. Differential revision: https://reviews.llvm.org/D63540
TSan spuriously reports for any OpenMP application a race on the initialization of a runtime internal mutex: ``` Atomic read of size 1 at 0x7b6800005940 by thread T4: #0 pthread_mutex_lock <null> (a.out+0x43f39e) #1 __kmp_resume_64 <null> (libomp.so.5+0x84db4) Previous write of size 1 at 0x7b6800005940 by thread T7: #0 pthread_mutex_init <null> (a.out+0x424793) #1 __kmp_suspend_initialize_thread <null> (libomp.so.5+0x8422e) ``` According to @AndreyChurbanov this is a false positive report, as the control flow of the runtime guarantees the ordering of the mutex initialization and the lock: https://software.intel.com/en-us/forums/intel-open-source-openmp-runtime-library/topic/530363 To suppress this report, I suggest the use of TSAN_OPTIONS='ignore_uninstrumented_modules=1'. With this patch, a runtime warning is provided in case an OpenMP application is built with Tsan and executed without this Tsan-option. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D70412
The test is currently failing on some systems with ASAN enabled due to: ``` ==22898==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x603000003da4 at pc 0x00010951c33d bp 0x7ffee6709e00 sp 0x7ffee67095c0 READ of size 5 at 0x603000003da4 thread T0 #0 0x10951c33c in wrap_memmove+0x16c (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x1833c) #1 0x7fff4a327f57 in CFDataReplaceBytes+0x1ba (CoreFoundation:x86_64+0x13f57) #2 0x7fff4a415a44 in __CFDataInit+0x2db (CoreFoundation:x86_64+0x101a44) intel#3 0x1094f8490 in main main.m:424 intel#4 0x7fff77482084 in start+0x0 (libdyld.dylib:x86_64+0x17084) 0x603000003da4 is located 0 bytes to the right of 20-byte region [0x603000003d90,0x603000003da4) allocated by thread T0 here: #0 0x109547c02 in wrap_calloc+0xa2 (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x43c02) #1 0x7fff763ad3ef in class_createInstance+0x52 (libobjc.A.dylib:x86_64+0x73ef) #2 0x7fff4c6b2d73 in NSAllocateObject+0x12 (Foundation:x86_64+0x1d73) intel#3 0x7fff4c6b5e5f in -[_NSPlaceholderData initWithBytes:length:copy:deallocator:]+0x40 (Foundation:x86_64+0x4e5f) intel#4 0x7fff4c6d4cf1 in -[NSData(NSData) initWithBytes:length:]+0x24 (Foundation:x86_64+0x23cf1) intel#5 0x1094f8245 in main main.m:404 intel#6 0x7fff77482084 in start+0x0 (libdyld.dylib:x86_64+0x17084) ``` The reason is that we create a string "HELLO" but get the size wrong (it's 5 bytes instead of 4). Later on we read the buffer and pretend it is 5 bytes long, causing an OOB read which ASAN detects. In general this test probably needs some cleanup as it produces on macOS 10.15 around 100 compiler warnings which isn't great, but let's first get the bot green.
Is this still relevant? Should we close this PR? |
I 'll close this. But this is my private clone, so should not matter - ? |
CONFLICT (content): Merge conflict in clang/lib/Sema/SemaChecking.cpp
CONFLICT (content): Merge conflict in clang/include/clang/Basic/DiagnosticDriverKinds.td
This patch re-introduces the fix in the commit llvm/llvm-project@66b0cebf7f736 by @yrnkrn > In DwarfEHPrepare, after all passes are run, RewindFunction may be a dangling > > pointer to a dead function. To make sure it's valid, doFinalization nullptrs > RewindFunction just like the constructor and so it will be found on next run. > > llvm-svn: 217737 It seems that the fix was not migrated to `DwarfEHPrepareLegacyPass`. This patch also updates `llvm/test/CodeGen/X86/dwarf-eh-prepare.ll` to include `-run-twice` to exercise the cleanup. Without this patch `llvm-lit -v llvm/test/CodeGen/X86/dwarf-eh-prepare.ll` fails with ``` -- Testing: 1 tests, 1 workers -- FAIL: LLVM :: CodeGen/X86/dwarf-eh-prepare.ll (1 of 1) ******************** TEST 'LLVM :: CodeGen/X86/dwarf-eh-prepare.ll' FAILED ******************** Script: -- : 'RUN: at line 1'; /home/arakaki/build/llvm-project/main/bin/opt -mtriple=x86_64-linux-gnu -dwarfehprepare -simplifycfg-require-and-preserve-domtree=1 -run-twice < /home/arakaki/repos/watch/llvm-project/llvm/test/CodeGen/X86/dwarf-eh-prepare.ll -S | /home/arakaki/build/llvm-project/main/bin/FileCheck /home/arakaki/repos/watch/llvm-project/llvm/test/CodeGen/X86/dwarf-eh-prepare.ll -- Exit Code: 2 Command Output (stderr): -- Referencing function in another module! call void @_Unwind_Resume(i8* %ehptr) #1 ; ModuleID = '<stdin>' void (i8*)* @_Unwind_Resume ; ModuleID = '<stdin>' in function simple_cleanup_catch LLVM ERROR: Broken function found, compilation aborted! PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace. Stack dump: 0. Program arguments: /home/arakaki/build/llvm-project/main/bin/opt -mtriple=x86_64-linux-gnu -dwarfehprepare -simplifycfg-require-and-preserve-domtree=1 -run-twice -S 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'Module Verifier' on function '@simple_cleanup_catch' #0 0x000056121b570a2c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/arakaki/repos/watch/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:0 #1 0x000056121b56eb64 llvm::sys::RunSignalHandlers() /home/arakaki/repos/watch/llvm-project/llvm/lib/Support/Signals.cpp:97:0 #2 0x000056121b56f28e SignalHandler(int) /home/arakaki/repos/watch/llvm-project/llvm/lib/Support/Unix/Signals.inc:397:0 intel#3 0x00007fc7e9b22980 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12980) intel#4 0x00007fc7e87d3fb7 raise /build/glibc-S7xCS9/glibc-2.27/signal/../sysdeps/unix/sysv/linux/raise.c:51:0 intel#5 0x00007fc7e87d5921 abort /build/glibc-S7xCS9/glibc-2.27/stdlib/abort.c:81:0 intel#6 0x000056121b4e1386 llvm::raw_svector_ostream::raw_svector_ostream(llvm::SmallVectorImpl<char>&) /home/arakaki/repos/watch/llvm-project/llvm/include/llvm/Support/raw_ostream.h:674:0 intel#7 0x000056121b4e1386 llvm::report_fatal_error(llvm::Twine const&, bool) /home/arakaki/repos/watch/llvm-project/llvm/lib/Support/ErrorHandling.cpp:114:0 intel#8 0x000056121b4e1528 (/home/arakaki/build/llvm-project/main/bin/opt+0x29e3528) intel#9 0x000056121adfd03f llvm::raw_ostream::operator<<(llvm::StringRef) /home/arakaki/repos/watch/llvm-project/llvm/include/llvm/Support/raw_ostream.h:218:0 FileCheck error: '<stdin>' is empty. FileCheck command line: /home/arakaki/build/llvm-project/main/bin/FileCheck /home/arakaki/repos/watch/llvm-project/llvm/test/CodeGen/X86/dwarf-eh-prepare.ll -- ******************** ******************** Failed Tests (1): LLVM :: CodeGen/X86/dwarf-eh-prepare.ll Testing Time: 0.22s Failed: 1 ``` Reviewed By: loladiro Differential Revision: https://reviews.llvm.org/D110979
When inserting a scalable subvector into a scalable vector through the stack, the index to store to needs to be scaled by vscale. Before this patch, that didn't yet happen, so it would generate the wrong offset, thus storing a subvector to the incorrect address and overwriting the wrong lanes. For some insert: nxv8f16 insert_subvector(nxv8f16 %vec, nxv2f16 %subvec, i64 2) The offset was not scaled by vscale: orr x8, x8, #0x4 st1h { z0.h }, p0, [sp] st1h { z1.d }, p1, [x8] ld1h { z0.h }, p0/z, [sp] And is changed to: mov x8, sp st1h { z0.h }, p0, [sp] st1h { z1.d }, p1, [x8, #1, mul vl] ld1h { z0.h }, p0/z, [sp] Differential Revision: https://reviews.llvm.org/D111633
PPC64 bot failed with the following error. The buildbot output is not particularly useful, but looking at other similar tests, it seems that there is something broken in free stacks on PPC64. Use the same hack as other tests use to expect an additional stray frame. /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c:28:11: error: CHECK: expected string not found in input // CHECK: Previous write of size 4 at {{.*}} by thread T1{{.*}}: ^ <stdin>:13:9: note: scanning from here #1 main /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c:17:3 (free_race3.c.tmp+0x1012fab8) ^ <stdin>:17:2: note: possible intended match here ThreadSanitizer: reported 1 warnings ^ Input file: <stdin> Check file: /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c -dump-input=help explains the following input dump. Input was: <<<<<< . . . 8: Previous write of size 4 at 0x7ffff4d01ab0 by thread T1: 9: #0 Thread /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c:8:10 (free_race3.c.tmp+0x1012f9dc) 10: 11: Thread T1 (tid=3222898, finished) created by main thread at: 12: #0 pthread_create /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1001:3 (free_race3.c.tmp+0x100b9040) 13: #1 main /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c:17:3 (free_race3.c.tmp+0x1012fab8) check:28'0 X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found 14: check:28'0 ~ 15: SUMMARY: ThreadSanitizer: data race /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/llvm/compiler-rt/test/tsan/free_race3.c:19:3 in main check:28'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 16: ================== check:28'0 ~~~~~~~~~~~~~~~~~~~ 17: ThreadSanitizer: reported 1 warnings check:28'0 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ check:28'1 ? possible intended match >>>>>> Reviewed By: melver Differential Revision: https://reviews.llvm.org/D112444
…turn to external addr part) Before we have an issue with artificial LBR whose source is a return, recalling that "an internal code(A) can return to external address, then from the external address call a new internal code(B), making an artificial branch that looks like a return from A to B can confuse the unwinder". We just ignore the LBRs after this artificial LBR which can miss some samples. This change aims at fixing this by correctly unwinding them instead of ignoring them. List some typical scenarios covered by this change. 1) multiple sequential call back happen in external address, e.g. ``` [ext, call, foo] [foo, return, ext] [ext, call, bar] ``` Unwinder should avoid having foo return from bar. Wrong call stack is like [foo, bar] 2) the call stack before and after external call should be correctly unwinded. ``` {call stack1} {call stack2} [foo, call, ext] [ext, call, bar] [bar, return, ext] [ext, return, foo ] ``` call stack 1 should be the same to call stack2. Both shouldn't be truncated 3) call stack should be truncated after call into external code since we can't do inlining with external code. ``` [foo, call, ext] [ext, call, bar] [bar, call, baz] [baz, return, bar ] [bar, return, ext] ``` the call stack of code in baz should not include foo. ### Implementation: We leverage artificial frame to fix #2 and intel#3: when we got a return artificial LBR, push an extra artificial frame to the stack. when we pop frame, check if the parent is an artificial frame to pop(fix #2). Therefore, call/ return artificial LBR is just the same as regular LBR which can keep the call stack. While recording context on the trie, artificial frame is used as a tag indicating that we should truncate the call stack(fix intel#3). To differentiate #1 and #2, we leverage `getCallAddrFromFrameAddr`. Normally the target of the return should be the next inst of a call inst and `getCallAddrFromFrameAddr` will return the address of call inst. Otherwise, getCallAddrFromFrameAddr will return to 0 which is the case of #1. Reviewed By: hoy, wenlei Differential Revision: https://reviews.llvm.org/D115550
…ce characters in lookup names when parsing the ctu index file This error was found when analyzing MySQL with CTU enabled. When there are space characters in the lookup name, the current delimiter searching strategy will make the file path wrongly parsed. And when two lookup names have the same prefix before their first space characters, a 'multiple definitions' error will be wrongly reported. e.g. The lookup names for the two lambda exprs in the test case are `c:@s@G@F@G#@sa@F@operator int (*)(char)#1` and `c:@s@G@F@G#@sa@F@operator bool (*)(char)#1` respectively. And their prefixes are both `c:@s@G@F@G#@sa@F@operator` when using the first space character as the delimiter. Solving the problem by adding a length for the lookup name, making the index items in the format of `USR-Length:USR File-Path`. Reviewed By: steakhal Differential Revision: https://reviews.llvm.org/D102669
…he parser" This reverts commit b0e8667. ASAN/UBSAN bot is broken with this trace: [ RUN ] FlatAffineConstraintsTest.FindSampleTest llvm-project/mlir/include/mlir/Support/MathExtras.h:27:15: runtime error: signed integer overflow: 1229996100002 * 809999700000 cannot be represented in type 'long' #0 0x7f63ace960e4 in mlir::ceilDiv(long, long) llvm-project/mlir/include/mlir/Support/MathExtras.h:27:15 #1 0x7f63ace8587e in ceil llvm-project/mlir/include/mlir/Analysis/Presburger/Fraction.h:57:42 #2 0x7f63ace8587e in operator* llvm-project/llvm/include/llvm/ADT/STLExtras.h:347:42 intel#3 0x7f63ace8587e in uninitialized_copy<llvm::mapped_iterator<mlir::Fraction *, long (*)(mlir::Fraction), long>, long *> include/c++/v1/__memory/uninitialized_algorithms.h:36:62 intel#4 0x7f63ace8587e in uninitialized_copy<llvm::mapped_iterator<mlir::Fraction *, long (*)(mlir::Fraction), long>, long *> llvm-project/llvm/include/llvm/ADT/SmallVector.h:490:5 intel#5 0x7f63ace8587e in append<llvm::mapped_iterator<mlir::Fraction *, long (*)(mlir::Fraction), long>, void> llvm-project/llvm/include/llvm/ADT/SmallVector.h:662:5 intel#6 0x7f63ace8587e in SmallVector<llvm::mapped_iterator<mlir::Fraction *, long (*)(mlir::Fraction), long> > llvm-project/llvm/include/llvm/ADT/SmallVector.h:1204:11 intel#7 0x7f63ace8587e in mlir::FlatAffineConstraints::findIntegerSample() const llvm-project/mlir/lib/Analysis/AffineStructures.cpp:1171:27 intel#8 0x7f63ae95a84d in mlir::checkSample(bool, mlir::FlatAffineConstraints const&, mlir::TestFunction) llvm-project/mlir/unittests/Analysis/AffineStructuresTest.cpp:37:23 intel#9 0x7f63ae957545 in mlir::FlatAffineConstraintsTest_FindSampleTest_Test::TestBody() llvm-project/mlir/unittests/Analysis/AffineStructuresTest.cpp:222:3
…se of OpenMP task construct Currently variables appearing inside shared clause of OpenMP task construct are not visible inside lldb debugger. After the current patch, lldb is able to show the variable ``` * thread #1, name = 'a.out', stop reason = breakpoint 1.1 frame #0: 0x0000000000400934 a.out`.omp_task_entry. [inlined] .omp_outlined.(.global_tid.=0, .part_id.=0x000000000071f0d0, .privates.=0x000000000071f0e8, .copy_fn.=(a.out`.omp_task_privates_map. at testshared.cxx:8), .task_t.=0x000000000071f0c0, __context=0x000000000071f0f0) at testshared.cxx:10:34 7 else { 8 #pragma omp task shared(svar) firstprivate(n) 9 { -> 10 printf("Task svar = %d\n", svar); 11 printf("Task n = %d\n", n); 12 svar = fib(n - 1); 13 } (lldb) p svar (int) $0 = 9 ``` Reviewed By: djtodoro Differential Revision: https://reviews.llvm.org/D115510
We experienced some deadlocks when we used multiple threads for logging using `scan-builds` intercept-build tool when we used multiple threads by e.g. logging `make -j16` ``` (gdb) bt #0 0x00007f2bb3aff110 in __lll_lock_wait () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x00007f2bb3af70a3 in pthread_mutex_lock () from /lib/x86_64-linux-gnu/libpthread.so.0 #2 0x00007f2bb3d152e4 in ?? () intel#3 0x00007ffcc5f0cc80 in ?? () intel#4 0x00007f2bb3d2bf5b in ?? () from /lib64/ld-linux-x86-64.so.2 intel#5 0x00007f2bb3b5da27 in ?? () from /lib/x86_64-linux-gnu/libc.so.6 intel#6 0x00007f2bb3b5dbe0 in exit () from /lib/x86_64-linux-gnu/libc.so.6 intel#7 0x00007f2bb3d144ee in ?? () intel#8 0x746e692f706d742f in ?? () intel#9 0x692d747065637265 in ?? () intel#10 0x2f653631326b3034 in ?? () intel#11 0x646d632e35353532 in ?? () intel#12 0x0000000000000000 in ?? () ``` I think the gcc's exit call caused the injected `libear.so` to be unloaded by the `ld`, which in turn called the `void on_unload() __attribute__((destructor))`. That tried to acquire an already locked mutex which was left locked in the `bear_report_call()` call, that probably encountered some error and returned early when it forgot to unlock the mutex. All of these are speculation since from the backtrace I could not verify if frames 2 and 3 are in fact corresponding to the `libear.so` module. But I think it's a fairly safe bet. So, hereby I'm releasing the held mutex on *all paths*, even if some failure happens. PS: I would use lock_guards, but it's C. Reviewed-by: NoQ Differential Revision: https://reviews.llvm.org/D118439
llvm.insertvalue and llvm.extractvalue need LLVM primitive type for the indexing operands. While upstreaming the TargetRewrite pass the change was made from i32 to index without knowing this restriction. This patch reverts back the types used for indexing in the two ops created in this pass. the error you will receive when lowering to LLVM IR with the current code is the following: ``` 'llvm.insertvalue' op operand #1 must be primitive LLVM type, but got 'index' ``` Reviewed By: jeanPerier, schweitz Differential Revision: https://reviews.llvm.org/D119253
There is a clangd crash at `__memcmp_avx2_movbe`. Short problem description is below. The method `HeaderIncludes::addExistingInclude` stores `Include` objects by reference at 2 places: `ExistingIncludes` (primary storage) and `IncludesByPriority` (pointer to the object's location at ExistingIncludes). `ExistingIncludes` is a map where value is a `SmallVector`. A new element is inserted by `push_back`. The operation might do resize. As result pointers stored at `IncludesByPriority` might become invalid. Typical stack trace ``` frame #0: 0x00007f11460dcd94 libc.so.6`__memcmp_avx2_movbe + 308 frame #1: 0x00000000004782b8 clangd`llvm::StringRef::compareMemory(Lhs=" \"t2.h\"", Rhs="", Length=6) at StringRef.h:76:22 frame #2: 0x0000000000701253 clangd`llvm::StringRef::compare(this=0x0000 7f10de7d8610, RHS=(Data = "", Length = 7166742329480737377)) const at String Ref.h:206:34 * frame intel#3: 0x00000000007603ab clangd`llvm::operator<(llvm::StringRef, llv m::StringRef)(LHS=(Data = "\"t2.h\"", Length = 6), RHS=(Data = "", Length = 7166742329480737377)) at StringRef.h:907:23 frame intel#4: 0x0000000002d0ad9f clangd`clang::tooling::HeaderIncludes::inse rt(this=0x00007f10de7fb1a0, IncludeName=(Data = "t2.h\"", Length = 4), IsAng led=false) const at HeaderIncludes.cpp:365:22 frame intel#5: 0x00000000012ebfdd clangd`clang::clangd::IncludeInserter::inse rt(this=0x00007f10de7fb148, VerbatimHeader=(Data = "\"t2.h\"", Length = 6)) const at Headers.cpp:262:70 ``` A unit test test for the crash was created (`HeaderIncludesTest.RepeatedIncludes`). The proposed solution is to use std::list instead of llvm::SmallVector Test Plan ``` ./tools/clang/unittests/Tooling/ToolingTests --gtest_filter=HeaderIncludesTest.RepeatedIncludes ``` Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D118755
A LUI instruction with flag RISCVII::MO_HI is usually used in conjunction with ADDI, and jointly complete address computation. To bind the cost evaluation of address computation, the LUI should not be regarded as a cheap move separately, which is consistent with ADDI. In this test case, it improves the unroll-loop code that the rematerialization of array's base address miss MachineCSE with Heuristics #1 at isProfitableToCSE. Reviewed By: asb, frasercrmck Differential Revision: https://reviews.llvm.org/D118216
This patch fixes a data race in IOHandlerProcessSTDIO. The race is happens between the main thread and the event handling thread. The main thread is running the IOHandler (IOHandlerProcessSTDIO::Run()) when an event comes in that makes us pop the process IO handler which involves cancelling the IOHandler (IOHandlerProcessSTDIO::Cancel). The latter calls SetIsDone(true) which modifies m_is_done. At the same time, we have the main thread reading the variable through GetIsDone(). This patch avoids the race by using a mutex to synchronize the two threads. On the event thread, in IOHandlerProcessSTDIO ::Cancel method, we obtain the lock before changing the value of m_is_done. On the main thread, in IOHandlerProcessSTDIO::Run(), we obtain the lock before reading the value of m_is_done. Additionally, we delay calling SetIsDone until after the loop exists, to avoid a potential race between the two writes. Write of size 1 at 0x00010b66bb68 by thread T7 (mutexes: write M2862, write M718324145051843688): #0 lldb_private::IOHandler::SetIsDone(bool) IOHandler.h:90 (liblldb.15.0.0git.dylib:arm64+0x971d84) #1 IOHandlerProcessSTDIO::Cancel() Process.cpp:4382 (liblldb.15.0.0git.dylib:arm64+0x5ddfec) #2 lldb_private::Debugger::PopIOHandler(std::__1::shared_ptr<lldb_private::IOHandler> const&) Debugger.cpp:1156 (liblldb.15.0.0git.dylib:arm64+0x3cb2a8) intel#3 lldb_private::Debugger::RemoveIOHandler(std::__1::shared_ptr<lldb_private::IOHandler> const&) Debugger.cpp:1063 (liblldb.15.0.0git.dylib:arm64+0x3cbd2c) intel#4 lldb_private::Process::PopProcessIOHandler() Process.cpp:4487 (liblldb.15.0.0git.dylib:arm64+0x5c583c) intel#5 lldb_private::Debugger::HandleProcessEvent(std::__1::shared_ptr<lldb_private::Event> const&) Debugger.cpp:1549 (liblldb.15.0.0git.dylib:arm64+0x3ceabc) intel#6 lldb_private::Debugger::DefaultEventHandler() Debugger.cpp:1622 (liblldb.15.0.0git.dylib:arm64+0x3cf2c0) intel#7 std::__1::__function::__func<lldb_private::Debugger::StartEventHandlerThread()::$_2, std::__1::allocator<lldb_private::Debugger::StartEventHandlerThread()::$_2>, void* ()>::operator()() function.h:352 (liblldb.15.0.0git.dylib:arm64+0x3d1bd8) intel#8 lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(void*) HostNativeThreadBase.cpp:62 (liblldb.15.0.0git.dylib:arm64+0x4c71ac) intel#9 lldb_private::HostThreadMacOSX::ThreadCreateTrampoline(void*) HostThreadMacOSX.mm:18 (liblldb.15.0.0git.dylib:arm64+0x29ef544) Previous read of size 1 at 0x00010b66bb68 by main thread: #0 lldb_private::IOHandler::GetIsDone() IOHandler.h:92 (liblldb.15.0.0git.dylib:arm64+0x971db8) #1 IOHandlerProcessSTDIO::Run() Process.cpp:4339 (liblldb.15.0.0git.dylib:arm64+0x5ddc7c) #2 lldb_private::Debugger::RunIOHandlers() Debugger.cpp:982 (liblldb.15.0.0git.dylib:arm64+0x3cb48c) intel#3 lldb_private::CommandInterpreter::RunCommandInterpreter(lldb_private::CommandInterpreterRunOptions&) CommandInterpreter.cpp:3298 (liblldb.15.0.0git.dylib:arm64+0x506478) intel#4 lldb::SBDebugger::RunCommandInterpreter(bool, bool) SBDebugger.cpp:1166 (liblldb.15.0.0git.dylib:arm64+0x53604) intel#5 Driver::MainLoop() Driver.cpp:634 (lldb:arm64+0x100006294) intel#6 main Driver.cpp:853 (lldb:arm64+0x100007344) Differential revision: https://reviews.llvm.org/D120762
This adds the jump slot mapping for RISCV. This enables lldb to attach to a remote debug server. Although this doesn't enable debugging RISCV targets, it is sufficient to attach, which is a slight improvement. Tested with DebugServer2: ~~~ (lldb) gdb-remote localhost:1234 (lldb) Process 71438 stopped * thread #1, name = 'reduced', stop reason = signal SIGTRAP frame #0: 0x0000003ff7fe1b20 error: Process 71438 is currently being debugged, kill the process before connecting. (lldb) register read general: x0 = 0x0000003ff7fe1b20 x1 = 0x0000002ae00d3a50 x2 = 0x0000003ffffff3e0 x3 = 0x0000002ae01566e0 x4 = 0x0000003fe567c7b0 x5 = 0x0000000000001000 x6 = 0x0000002ae00604ec x7 = 0x00000000000003ff x8 = 0x0000003fffc22db0 x9 = 0x0000000000000000 x10 = 0x0000000000000000 x11 = 0x0000002ae603b1c0 x12 = 0x0000002ae6039350 x13 = 0x0000000000000000 x14 = 0x0000002ae6039350 x15 = 0x0000002ae6039350 x16 = 0x73642f74756f3d5f x17 = 0x00000000000000dd x18 = 0x0000002ae6038f08 x19 = 0x0000002ae603b1c0 x20 = 0x0000002b0f3d3f40 x21 = 0x0000003ff0b212d0 x22 = 0x0000002b0f3a2740 x23 = 0x0000002b0f3de3a0 x24 = 0x0000002b0f3d3f40 x25 = 0x0000002ad6929850 x26 = 0x0000000000000000 x27 = 0x0000002ad69297c0 x28 = 0x0000003fe578b364 x29 = 0x000000000000002f x30 = 0x0000000000000000 x31 = 0x0000002ae602401a pc = 0x0000003ff7fe1b20 ft0 = 0 ft1 = 0 ft2 = 0 ft3 = 0 ft4 = 0 ft5 = 0 ft6 = 0 ft7 = 0 fs0 = 0 fs1 = 0 fa0 = 0 fa1 = 0 fa2 = 0 fa3 = 0 fa4 = 0 fa5 = 0 fa6 = 0 fa7 = 9.10304232197721e-313 fs2 = 0 fs3 = 1.35805727667792e-312 fs4 = 1.35589259164679e-312 fs5 = 1.35805727659887e-312 fs6 = 9.10304232355822e-313 fs7 = 0 fs8 = 9.10304233027751e-313 fs9 = 0 fs10 = 9.10304232948701e-313 fs11 = 1.35588724164707e-312 ft8 = 0 ft9 = 9.1372158616833e-313 ft10 = 9.13720376537528e-313 ft11 = 1.356808717416e-312 3 registers were unavailable. (lldb) disassemble error: Failed to disassemble memory at 0x3ff7fe1b2 ~~~
Add support to inspect the ELF headers for RISCV targets to determine if RVC or RVE are enabled and the floating point support to enable. As per the RISCV specification, d implies f, q implies d implies f, which gives us the cascading effect that is used to enable the features when setting up the disassembler. With this change, it is now possible to attach the debugger to a remote process and be able to disassemble the instruction stream. ~~~ $ bin/lldb tmp/reduced (lldb) target create "reduced" Current executable set to '/tmp/reduced' (riscv64). (lldb) gdb-remote localhost:1234 (lldb) Process 5737 stopped * thread #1, name = 'reduced', stop reason = signal SIGTRAP frame #0: 0x0000003ff7fe1b20 -> 0x3ff7fe1b20: mv a0, sp 0x3ff7fe1b22: jal 1936 0x3ff7fe1b26: mv s0, a0 0x3ff7fe1b28: auipc a0, 27 ~~~
…ce characters in lookup names when parsing the ctu index file This error was found when analyzing MySQL with CTU enabled. When there are space characters in the lookup name, the current delimiter searching strategy will make the file path wrongly parsed. And when two lookup names have the same prefix before their first space characters, a 'multiple definitions' error will be wrongly reported. e.g. The lookup names for the two lambda exprs in the test case are `c:@s@G@F@G#@sa@F@operator int (*)(char)#1` and `c:@s@G@F@G#@sa@F@operator bool (*)(char)#1` respectively. And their prefixes are both `c:@s@G@F@G#@sa@F@operator` when using the first space character as the delimiter. Solving the problem by adding a length for the lookup name, making the index items in the format of `<USR-Length>:<USR File> <Path>`. --- In the test case of this patch, we found that it will trigger a "triple mismatch" warning when using `clang -cc1` to analyze the source file with CTU using the on-demand-parsing strategy in Darwin systems. And this problem is also encountered in D75665, which is the patch introducing the on-demand parsing strategy. We temporarily bypass this problem by using the loading-ast-file strategy. Refer to the [discourse topic](https://discourse.llvm.org/t/60762) for more details. Differential Revision: https://reviews.llvm.org/D102669
I'm adding two new classes that can be used to measure the duration of long tasks as process and thread level, e.g. decoding, fetching data from lldb-server, etc. In this first patch, I'm using it to measure the time it takes to decode each thread, which is printed out with the `dump info` command. In a later patch I'll start adding process-level tasks and I might move these classes to the upper Trace level, instead of having them in the intel-pt plugin. I might need to do that anyway in the future when we have to measure HTR. For now, I want to keep the impact of this change minimal. With it, I was able to generate the following info of a very big trace: ``` (lldb) thread trace dump info Trace technology: intel-pt thread #1: tid = 616081 Total number of instructions: 9729366 Memory usage: Raw trace size: 1024 KiB Total approximate memory usage (excluding raw trace): 123517.34 KiB Average memory usage per instruction (excluding raw trace): 13.00 bytes Timing: Decoding instructions: 1.62s Errors: Number of TSC decoding errors: 0 ``` As seen above, it took 1.62 seconds to decode 9.7M instructions. This is great news, as we don't need to do any optimization work in this area. Differential Revision: https://reviews.llvm.org/D123357
Detected on many lld tests with -fsanitize-memory-use-after-dtor. Also https://lab.llvm.org/buildbot/#/builders/sanitizer-x86_64-linux-fast after D122869 will report a lot of them. Threads may outlive static variables. Even if ~__thread_specific_ptr() does nothing, lifetime of members ends with ~ and accessing the value is UB https://eel.is/c++draft/basic.life#1 ``` ==9214==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x557e1cec4539 in __libcpp_tls_set ../include/c++/v1/__threading_support:428:12 #1 0x557e1cec4539 in set_pointer ../include/c++/v1/thread:196:5 #2 0x557e1cec4539 in void* std::__msan::__thread_proxy< std::__msan::tuple<...>, llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::'lambda'()::operator()() const::'lambda'()> >(void*) ../include/c++/v1/thread:285:27 Memory was marked as uninitialized #0 0x557e10a0759d in __sanitizer_dtor_callback compiler-rt/lib/msan/msan_interceptors.cpp:940:5 #1 0x557e1d8c478d in std::__msan::__thread_specific_ptr<std::__msan::__thread_struct>::~__thread_specific_ptr() libcxx/include/thread:188:1 #2 0x557e10a07dc0 in MSanCxaAtExitWrapper(void*) compiler-rt/lib/msan/msan_interceptors.cpp:1151:3 ``` The test needs D123979 or -fsanitize-memory-param-retval enabled by default. Reviewed By: ldionne, #libc Differential Revision: https://reviews.llvm.org/D122864
A trace might contain events traced during the target's execution. For example, a thread might be paused for some period of time due to context switches or breakpoints, which actually force a context switch. Not only that, a trace might be paused because the CPU decides to trace only a specific part of the target, like the address filtering provided by intel pt, which will cause pause events. Besides this case, other kinds of events might exist. This patch adds the method `TraceCursor::GetEvents()`` that returns the list of events that happened right before the instruction being pointed at by the cursor. Some refactors were done to make this change simpler. Besides this new API, the instruction dumper now supports the -e flag which shows pause events, like in the following example, where pauses happened due to breakpoints. ``` thread #1: tid = 2717361 a.out`main + 20 at main.cpp:27:20 0: 0x00000000004023d9 leaq -0x1200(%rbp), %rax [paused] 1: 0x00000000004023e0 movq %rax, %rdi [paused] 2: 0x00000000004023e3 callq 0x403a62 ; std::vector<int, std::allocator<int> >::vector at stl_vector.h:391:7 a.out`std::vector<int, std::allocator<int> >::vector() at stl_vector.h:391:7 3: 0x0000000000403a62 pushq %rbp 4: 0x0000000000403a63 movq %rsp, %rbp ``` The `dump info` command has also been updated and now it shows the number of instructions that have associated events. Differential Revision: https://reviews.llvm.org/D123982
…ified offset and its parents or children with spcified depth." This reverts commit a3b7cb0. symbol-offset.test fails under MSAN: [ 1] ; RUN: llvm-pdbutil yaml2pdb %p/Inputs/symbol-offset.yaml --pdb=%t.pdb [FAIL] llvm-pdbutil yaml2pdb <REDACTED>/llvm/test/tools/llvm-pdbutil/Inputs/symbol-offset.yaml --pdb=<REDACTED>/tmp/symbol-offset.test/symbol-offset.test.tmp.pdb ==9283==WARNING: MemorySanitizer: use-of-uninitialized-value #0 0x55f975e5eb91 in __libcpp_tls_set <REDACTED>/include/c++/v1/__threading_support:428:12 #1 0x55f975e5eb91 in set_pointer <REDACTED>/include/c++/v1/thread:196:5 #2 0x55f975e5eb91 in void* std::__msan::__thread_proxy<std::__msan::tuple<std::__msan::unique_ptr<std::__msan::__thread_struct, std::__msan::default_delete<std::__msan::__thread_struct> >, llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::ThreadPoolExecutor(llvm::ThreadPoolStrategy)::'lambda'()::operator()() const::'lambda'()> >(void*) <REDACTED>/include/c++/v1/thread:285:27 intel#3 0x7f74a1e55b54 in start_thread (<REDACTED>/libpthread.so.0+0xbb54) (BuildId: 64752de50ebd1a108f4b3f8d0d7e1a13) intel#4 0x7f74a1dc9f7e in clone (<REDACTED>/libc.so.6+0x13cf7e) (BuildId: 7cfed7708e5ab7fcb286b373de21ee76)
- Decouple TSCs from trace items - Turn TSCs into events just like CPUs. The new name is HW clock tick, wich could be reused by other vendors. - Add a GetWallTime that returns the wall time that the trace plug-in can infer for each trace item. - For intel pt, we are doing the following interpolation: if an instruction takes less than 1 TSC, we use that duration, otherwise, we assume the instruction took 1 TSC. This helps us avoid having to handle context switches, changes to kernel, idle times, decoding errors, etc. We are just trying to show some approximation and not the real data. For the real data, TSCs are the way to go. Besides that, we are making sure that no two trace items will give the same interpolation value. Finally, we are using as time 0 the time at which tracing started. Sample output: ``` (lldb) r Process 750047 launched: '/home/wallace/a.out' (x86_64) Process 750047 stopped * thread #1, name = 'a.out', stop reason = breakpoint 1.1 frame #0: 0x0000000000402479 a.out`main at main.cpp:29:20 26 }; 27 28 int main() { -> 29 std::vector<int> vvv; 30 for (int i = 0; i < 100; i++) 31 vvv.push_back(i); 32 (lldb) process trace start -s 64kb -t --per-cpu (lldb) b 60 Breakpoint 2: where = a.out`main + 1689 at main.cpp:60:23, address = 0x0000000000402afe (lldb) c Process 750047 resuming Process 750047 stopped * thread #1, name = 'a.out', stop reason = breakpoint 2.1 frame #0: 0x0000000000402afe a.out`main at main.cpp:60:23 57 map<int, int> m; 58 m[3] = 4; 59 -> 60 map<string, string> m2; 61 m2["5"] = "6"; 62 63 std::vector<std::string> vs = {"2", "3"}; (lldb) thread trace dump instructions -t -f -e thread #1: tid = 750047 0: [379567.000 ns] (event) HW clock tick [48599428476224707] 1: [379569.000 ns] (event) CPU core changed [new CPU=2] 2: [390487.000 ns] (event) HW clock tick [48599428476246495] 3: [1602508.000 ns] (event) HW clock tick [48599428478664855] 4: [1662745.000 ns] (event) HW clock tick [48599428478785046] libc.so.6`malloc 5: [1662746.995 ns] 0x00007ffff7176660 endbr64 6: [1662748.991 ns] 0x00007ffff7176664 movq 0x32387d(%rip), %rax ; + 408 7: [1662750.986 ns] 0x00007ffff717666b pushq %r12 8: [1662752.981 ns] 0x00007ffff717666d pushq %rbp 9: [1662754.977 ns] 0x00007ffff717666e pushq %rbx 10: [1662756.972 ns] 0x00007ffff717666f movq (%rax), %rax 11: [1662758.967 ns] 0x00007ffff7176672 testq %rax, %rax 12: [1662760.963 ns] 0x00007ffff7176675 jne 0x9c7e0 ; <+384> 13: [1662762.958 ns] 0x00007ffff717667b leaq 0x17(%rdi), %rax 14: [1662764.953 ns] 0x00007ffff717667f cmpq $0x1f, %rax 15: [1662766.949 ns] 0x00007ffff7176683 ja 0x9c730 ; <+208> 16: [1662768.944 ns] 0x00007ffff7176730 andq $-0x10, %rax 17: [1662770.939 ns] 0x00007ffff7176734 cmpq $-0x41, %rax 18: [1662772.935 ns] 0x00007ffff7176738 seta %dl 19: [1662774.930 ns] 0x00007ffff717673b jmp 0x9c690 ; <+48> 20: [1662776.925 ns] 0x00007ffff7176690 cmpq %rdi, %rax 21: [1662778.921 ns] 0x00007ffff7176693 jb 0x9c7b0 ; <+336> 22: [1662780.916 ns] 0x00007ffff7176699 testb %dl, %dl 23: [1662782.911 ns] 0x00007ffff717669b jne 0x9c7b0 ; <+336> 24: [1662784.906 ns] 0x00007ffff71766a1 movq 0x3236c0(%rip), %r12 ; + 24 (lldb) thread trace dump instructions -t -f -e -J -c 4 [ { "id": 0, "timestamp_ns": "379567.000000", "event": "HW clock tick", "hwClock": 48599428476224707 }, { "id": 1, "timestamp_ns": "379569.000000", "event": "CPU core changed", "cpuId": 2 }, { "id": 2, "timestamp_ns": "390487.000000", "event": "HW clock tick", "hwClock": 48599428476246495 }, { "id": 3, "timestamp_ns": "1602508.000000", "event": "HW clock tick", "hwClock": 48599428478664855 }, { "id": 4, "timestamp_ns": "1662745.000000", "event": "HW clock tick", "hwClock": 48599428478785046 }, { "id": 5, "timestamp_ns": "1662746.995324", "loadAddress": "0x7ffff7176660", "module": "libc.so.6", "symbol": "malloc", "mnemonic": "endbr64" }, { "id": 6, "timestamp_ns": "1662748.990648", "loadAddress": "0x7ffff7176664", "module": "libc.so.6", "symbol": "malloc", "mnemonic": "movq" }, { "id": 7, "timestamp_ns": "1662750.985972", "loadAddress": "0x7ffff717666b", "module": "libc.so.6", "symbol": "malloc", "mnemonic": "pushq" }, { "id": 8, "timestamp_ns": "1662752.981296", "loadAddress": "0x7ffff717666d", "module": "libc.so.6", "symbol": "malloc", "mnemonic": "pushq" } ] ``` Differential Revision: https://reviews.llvm.org/D130054
Refactor the string conversion of the `lldb::InstructionControlFlowKind` enum out of `Instruction::Dump` to enable reuse of this logic by the JSON TraceDumper (to be implemented in separate diff). Will coordinate the landing of this change with D130320 since there will be a minor merge conflict between these changes. Test Plan: Run unittests ``` > ninja check-lldb [4/5] Running lldb unit test suite Testing Time: 10.13s Passed: 1084 ``` Verify '-k' flag's output ``` (lldb) thread trace dump instructions -k thread #1: tid = 1375377 libstdc++.so.6`std::ostream::flush() + 43 7048: 0x00007ffff7b54dab return retq 7047: 0x00007ffff7b54daa other popq %rbx 7046: 0x00007ffff7b54da7 other movq %rbx, %rax 7045: 0x00007ffff7b54da5 cond jump je 0x11adb0 ; <+48> 7044: 0x00007ffff7b54da2 other cmpl $-0x1, %eax libc.so.6`_IO_fflush + 249 7043: 0x00007ffff7161729 return retq 7042: 0x00007ffff7161728 other popq %rbp 7041: 0x00007ffff7161727 other popq %rbx 7040: 0x00007ffff7161725 other movl %edx, %eax 7039: 0x00007ffff7161721 other addq $0x8, %rsp 7038: 0x00007ffff7161709 cond jump je 0x87721 ; <+241> 7037: 0x00007ffff7161707 other decl (%rsi) 7036: 0x00007ffff71616fe cond jump je 0x87707 ; <+215> 7035: 0x00007ffff71616f7 other cmpl $0x0, 0x33de92(%rip) ; __libc_multiple_threads 7034: 0x00007ffff71616ef other movq $0x0, 0x8(%rsi) 7033: 0x00007ffff71616ed cond jump jne 0x87721 ; <+241> 7032: 0x00007ffff71616e9 other subl $0x1, 0x4(%rsi) 7031: 0x00007ffff71616e2 other movq 0x88(%rbx), %rsi 7030: 0x00007ffff71616e0 cond jump jne 0x87721 ; <+241> 7029: 0x00007ffff71616da other testl $0x8000, (%rbx) ; imm = 0x8000 ``` Differential Revision: https://reviews.llvm.org/D130580
This reverts commit 5fb4134. This patch is causing crashes when building llvm-test-suite when optimizing for CPUs with AVX512. Reproducer crashing with llc: target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx" define i32 @test(<32 x i32> %0) #0 { entry: %1 = mul <32 x i32> %0, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> %2 = tail call i32 @llvm.vector.reduce.add.v32i32(<32 x i32> %1) ret i32 %2 } ; Function Attrs: nocallback nofree nosync nounwind readnone willreturn declare i32 @llvm.vector.reduce.add.v32i32(<32 x i32>) #1 attributes #0 = { "min-legal-vector-width"="0" "target-cpu"="skylake-avx512" } attributes #1 = { nocallback nofree nosync nounwind readnone willreturn }
This diff uncovers an ASAN leak in getOrCreateJumpTable: ``` Indirect leak of 264 byte(s) in 1 object(s) allocated from: #1 0x4f6e48c in llvm::bolt::BinaryContext::getOrCreateJumpTable ... ``` The removal of an assertion needs to be accompanied by proper deallocation of a `JumpTable` object for which `analyzeJumpTable` was unsuccessful. This reverts commit 52cd00c.
AArch64InstrInfo::optimizePTestInstr attempts to remove a PTEST of a predicate generating operation that identically sets flags (implictly). When the PTEST and the predicate-generating operation use the same mask the PTEST is currently removed. This is incorrect since it doesn't consider element size. PTEST operates on 8-bit predicates, but for instructions like compare that also support 16/32/64-bit predicates, the implicit PTEST performed by the instruction will consider fewer lanes for these element sizes and could set different first or last active flags. For example, consider the following instruction sequence ptrue p0.b ; P0=1111-1111-1111-1111 index z0.s, #0, #1 ; Z0=<0,1,2,3> index z1.s, #1, #1 ; Z1=<1,2,3,4> cmphi p1.s, p0/z, z1.s, z0.s ; P1=0001-0001-0001-0001 ; ^ last active ptest p0, p1.b ; P1=0001-0001-0001-0001 ; ^ last active where the compare generates a canonical all active 32-bit predicate (equivalent to 'ptrue p1.s, all'). The implicit PTEST sets the last active flag, whereas the PTEST instruction with the same mask doesn't. This patch restricts the optimization to instructions operating on 8-bit predicates. One caveat is the optimization is safe regardless of element size for any active, this will be addressed in a later patch. Reviewed By: bsmith Differential Revision: https://reviews.llvm.org/D137716
Verify three cases of G_UNMERGE_VALUES separately: 1. Splitting a vector into subvectors (the converse of G_CONCAT_VECTORS). 2. Splitting a vector into its elements (the converse of G_BUILD_VECTOR). 3. Splitting a scalar into smaller scalars (the converse of G_MERGE_VALUES). Previously #1 allowed strange combinations like this: %1:_(<2 x s16>),%2:_(<2 x s16>) = G_UNMERGE_VALUES %0(<2 x s32>) This has been tightened up to check that the source and destination element types match, and some MIR test cases updated accordingly. Differential Revision: https://reviews.llvm.org/D111132
…-seh.mm (NFC)" This reverts commit 01023bf. The extended test now triggers undefined behavior: ``` /b/sanitizer-aarch64-linux-bootstrap-ubsan/build/llvm-project/llvm/lib/Transforms/ObjCARC/ObjCARCOpts.cpp:577:41: runtime error: load of value 180, which is not a valid value for type 'bool' #0 0xaaaae3333a30 in hasCFGChanged /b/sanitizer-aarch64-linux-bootstrap-ubsan/build/llvm-project/llvm/lib/Transforms/ObjCARC/ObjCARCOpts.cpp:577:41 #1 0xaaaae3333a30 in llvm::ObjCARCOptPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-aarch64-linux-bootstrap-ubsan/build/llvm-project/llvm/lib/Transforms/ObjCARC/ObjCARCOpts.cpp:2494:26 ... ```
Casting a pointer to a suitably large integral type by reinterpret-cast should result in the same value as by using the `__builtin_bit_cast()`. The compiler exploits this: https://godbolt.org/z/zMP3sG683 However, the analyzer does not bind the same symbolic value to these expressions, resulting in weird situations, such as failing equality checks and even results in crashes: https://godbolt.org/z/oeMP7cj8q Previously, in the `RegionStoreManager::getBinding()` even if `T` was non-null, we replaced it with `TVR->getValueType()` in case the `MR` was `TypedValueRegion`. It doesn't make much sense to auto-detect the type if the type is already given. By not doing the auto-detection, we would just do the right thing and perform the load by that type. This means that we will cast the value to that type. So, in this patch, I'm proposing to do auto-detection only if the type was null. Here is a snippet of code, annotated by the previous and new dump values. `LocAsInteger` should wrap the `SymRegion`, since we want to load the address as if it was an integer. In none of the following cases should type auto-detection be triggered, hence we should eventually reach an `evalCast()` to lazily cast the loaded value into that type. ```lang=C++ void LValueToRValueBitCast_dumps(void *p, char (*array)[8]) { clang_analyzer_dump(p); // remained: &SymRegion{reg_$0<void * p>} clang_analyzer_dump(array); // remained: {{&SymRegion{reg_$1<char (*)[8] array>} clang_analyzer_dump((unsigned long)p); // remained: {{&SymRegion{reg_$0<void * p>} [as 64 bit integer]}} clang_analyzer_dump(__builtin_bit_cast(unsigned long, p)); <--------- change #1 // previously: {{&SymRegion{reg_$0<void * p>}}} // now: {{&SymRegion{reg_$0<void * p>} [as 64 bit integer]}} clang_analyzer_dump((unsigned long)array); // remained: {{&SymRegion{reg_$1<char (*)[8] array>} [as 64 bit integer]}} clang_analyzer_dump(__builtin_bit_cast(unsigned long, array)); <--------- change #2 // previously: {{&SymRegion{reg_$1<char (*)[8] array>}}} // now: {{&SymRegion{reg_$1<char (*)[8] array>} [as 64 bit integer]}} } ``` Reviewed By: xazax.hun Differential Revision: https://reviews.llvm.org/D136603
The Assignment Tracking debug-info feature is outlined in this RFC: https://discourse.llvm.org/t/ rfc-assignment-tracking-a-better-way-of-specifying-variable-locations-in-ir Add initial revision of assignment tracking analysis pass --------------------------------------------------------- This patch squashes five individually reviewed patches into one: #1 https://reviews.llvm.org/D136320 #2 https://reviews.llvm.org/D136321 intel#3 https://reviews.llvm.org/D136325 intel#4 https://reviews.llvm.org/D136331 intel#5 https://reviews.llvm.org/D136335 Patch #1 introduces 2 new files: AssignmentTrackingAnalysis.h and .cpp. The two subsequent patches modify those files only. Patch intel#4 plumbs the analysis into SelectionDAG, and patch intel#5 is a collection of tests for the analysis as a whole. The analysis was broken up into smaller chunks for review purposes but for the most part the tests were written using the whole analysis. It would be possible to break up the tests for patches #1 through intel#3 for the purpose of landing the patches seperately. However, most them would require an update for each patch. In addition, patch intel#4 - which connects the analysis to SelectionDAG - is required by all of the tests. If there is build-bot trouble, we might try a different landing sequence. Analysis problem and goal ------------------------- Variables values can be stored in memory, or available as SSA values, or both. Using the Assignment Tracking metadata, it's not possible to determine a variable location just by looking at a debug intrinsic in isolation. Instructions without any metadata can change the location of a variable. The meaning of dbg.assign intrinsics changes depending on whether there are linked instructions, and where they are relative to those instructions. So we need to analyse the IR and convert the embedded information into a form that SelectionDAG can consume to produce debug variable locations in MIR. The solution is a dataflow analysis which, aiming to maximise the memory location coverage for variables, outputs a mapping of instruction positions to variable location definitions. API usage --------- The analysis is named `AssignmentTrackingAnalysis`. It is added as a required pass for SelectionDAGISel when assignment tracking is enabled. The results of the analysis are exposed via `getResults` using the returned `const FunctionVarLocs *`'s const methods: const VarLocInfo *single_locs_begin() const; const VarLocInfo *single_locs_end() const; const VarLocInfo *locs_begin(const Instruction *Before) const; const VarLocInfo *locs_end(const Instruction *Before) const; void print(raw_ostream &OS, const Function &Fn) const; Debug intrinsics can be ignored after running the analysis. Instead, variable location definitions that occur between an instruction `Inst` and its predecessor (or block start) can be found by looping over the range: locs_begin(Inst), locs_end(Inst) Similarly, variables with a memory location that is valid for their lifetime can be iterated over using the range: single_locs_begin(), single_locs_end() Further detail -------------- For an explanation of the dataflow implementation and the integration with SelectionDAG, please see the reviews linked at the top of this commit message. Reviewed By: jmorse
This is the first part of SYCL hierarchical parallelism implementation. It
implements main related APIs:
It is able to run workloads which use these APIs but do not contain data
or code with group-visible side effects between the work group and work
item scopes.
This is main part of the previous PR intel#221, which was split into parts.
Signed-off-by: Konstantin S Bobrovsky konstantin.s.bobrovsky@intel.com