Skip to content

Add check inversion for x and xor x, -1 #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

zjaffal
Copy link
Owner

@zjaffal zjaffal commented Jun 23, 2024

following llvm#94915 (comment) we can move the check from there inside CheckInversion and use the code for foldOrOfInversions to do the simplification

tbaederr and others added 30 commits June 20, 2024 16:34
Use a series of ops in that case, getting us to the right declaration
field.
This is a three instruction expansion, and does not depend on zba, so
most of the test changes are in base RV32/64I configurations.

With zba, this gets immediates such as 14, 28, 30, 56, 60, 62.. which
aren't covered by our other expansions.
… in documentation (NFC) (llvm#95003)

The checker was renamed at some time ago but the documentation was not
updated. The section is now just moved and renamed. The documentation is
still very simple and needs improvement.
…elease (llvm#91862)

### Context

We have a longstanding performance issue on Windows, where to this day,
the default heap allocator is still lockfull. With the number of cores
increasing, building and using LLVM with the default Windows heap
allocator is sub-optimal. Notably, the ThinLTO link times with LLD are
extremely long, and increase proportionally with the number of cores in
the machine.

In
llvm@a6a37a2,
I introduced the ability build LLVM with several popular lock-free
allocators. Downstream users however have to build their own toolchain
with this option, and building an optimal toolchain is a bit tedious and
long. Additionally, LLVM is now integrated into Visual Studio, which
AFAIK re-distributes the vanilla LLVM binaries/installer. The point
being that many users are impacted and might not be aware of this
problem, or are unable to build a more optimal version of the toolchain.

The symptom before this PR is that most of the CPU time goes to the
kernel (darker blue) when linking with ThinLTO:


![16c_ryzen9_windows_heap](https://github.com/llvm/llvm-project/assets/37383324/86c3f6b9-6028-4c1a-ba60-a2fa3876fba7)

With this PR, most time is spent in user space (light blue):


![16c_ryzen9_rpmalloc](https://github.com/llvm/llvm-project/assets/37383324/646b88f3-5b6d-485d-a2e4-15b520bdaf5b)

On higher core count machines, before this PR, the CPU usage becomes
pretty much flat because of contention:

<img width="549" alt="VM_176_windows_heap"
src="https://github.com/llvm/llvm-project/assets/37383324/f27d5800-ee02-496d-a4e7-88177e0727f0">


With this PR, similarily most CPU time is now used:

<img width="549" alt="VM_176_with_rpmalloc"
src="https://github.com/llvm/llvm-project/assets/37383324/7d4785dd-94a7-4f06-9b16-aaa4e2e505c8">

### Changes in this PR

The avenue I've taken here is to vendor/re-licence rpmalloc in-tree, and
use it when building the Windows 64-bit release. Given the permissive
rpmalloc licence, prior discussions with the LLVM foundation and
@lattner suggested this vendoring. Rpmalloc's author (@mjansson) kindly
agreed to ~~donate~~ re-licence the rpmalloc code in LLVM (please do
correct me if I misinterpreted our past communications).

I've chosen rpmalloc because it's small and gives the best value
overall. The source code is only 4 .c files. Rpmalloc is statically
replacing the weak CRT alloc symbols at link time, and has no dynamic
patching like mimalloc. As an alternative, there were several
unsuccessfull attempts made by Russell Gallop to use SCUDO in the past,
please see thread in https://reviews.llvm.org/D86694. If later someone
comes up with a PR of similar performance that uses SCUDO, we could then
delete this vendored rpmalloc folder.

I've added a new cmake flag `LLVM_ENABLE_RPMALLOC` which essentialy sets
`LLVM_INTEGRATED_CRT_ALLOC` to the in-tree rpmalloc source.

### Performance

The most obvious test is profling a ThinLTO linking step with LLD. I've
used a Clang compilation as a testbed, ie.
```
set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 -Xclang -O3 -fstrict-aliasing -march=native -flto=thin -fwhole-program-vtables -fuse-ld=lld
cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%" -DLLVM_ENABLE_LTO=THIN
```
I've profiled the linking step with no LTO cache, with Powershell, such
as:
```
Measure-Command { lld-link /nologo @CMakeFiles\clang.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /MANIFEST:EMBED,ID=1 }`
```

Timings:

| Machine | Allocator | Time to link |
|--------|--------|--------|
| 16c/32t AMD Ryzen 9 5950X | Windows Heap | 10 min 38 sec |
|  | **Rpmalloc** | **4 min 11 sec** |
| 32c/64t AMD Ryzen Threadripper PRO 3975WX | Windows Heap | 23 min 29
sec |
|  | **Rpmalloc** | **2 min 11 sec** |
|  | **Rpmalloc + /threads:64** | **1 min 50 sec** |
| 176 vCPU (2 socket) Intel Xeon Platinium 8481C (fixed clock 2.7 GHz) |
Windows Heap | 43 min 40 sec |
|  | **Rpmalloc** | **1 min 45 sec** |

This also improves the overall performance when building with clang-cl.
I've profiled a regular compilation of clang itself, ie:
```
set OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 /arch:AVX -fuse-ld=lld
cmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS="clang;lld" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS="%OPTS%" -DCMAKE_C_FLAGS="%OPTS%"
```
This saves approx. 30 sec when building on the Threadripper PRO 3975WX:
```
(default Windows Heap)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     392.716 s ±  3.830 s    [User: 17734.025 s, System: 1078.674 s]
  Range (min … max):   390.127 s … 399.449 s    5 runs

(rpmalloc)
C:\src\git\llvm-project>hyperfine -r 5 -p "make_llvm.bat stage1_test2" "ninja clang -C stage1_test2"
Benchmark 1: ninja clang -C stage1_test2
  Time (mean ± σ):     360.824 s ±  1.162 s    [User: 15148.637 s, System: 905.175 s]
  Range (min … max):   359.208 s … 362.288 s    5 runs
```
…bindings (llvm#96150)

The MLIR C and Python Bindings expose various methods from
`mlir::OpPrintingFlags` . This PR adds a binding for the `skipRegions`
method, which allows to skip the printing of Regions when printing Ops.
It also exposes this option as parameter in the python `get_asm` and
`print` methods
Note: our baremetal arm configuration compiles this as
`--target=arm-none-eabi`, so this code is built in -marm mode. It could be
smaller with `--target=armv7-none-eabi -mthumb`. The assembler is valid ARMv5,
or THUMB2, but not THUMB(1).
The use of SmallVector here saves 4.7% of heap allocations during the
compilation of ConvertExpr.cpp.ii, a preprocessed version of
flang/lib/Lower/ConvertExpr.cpp.
Add DwarfRegAlias for VSRPair as it shares dwarfRegNum with the VR
registers.
... so move it out of the `implied_features` list, and into the
`DefaultExts` list.
SmallPtrSet.h and TimeProfiler.h are unused. CommandLine.h is only
needed for the UseNewDbgInfoFormat declare, which can be moved to the
places that need it.
…ngth vectors. (llvm#96081)

This also fixes the case where an SVE div is incorrectly to be assumed
available in non-streaming mode with SME.
llvm#91100)

### Background

Doxygen's `\par` command
([link](https://www.doxygen.nl/manual/commands.html#cmdpar)) has an
optional argument, which denotes the header of the paragraph started by
a given `\par` command.

In short, the paragraph command can be used with a heading, or without
one. The code block below shows both forms and how the current version
of LLVM/Clang parses this code:
```
$ cat test.cpp
/// \par User defined paragraph:
/// Contents of the paragraph.
///
/// \par
/// New paragraph under the same heading.
///
/// \par
/// A second paragraph.
class A {};

$ clang++ -cc1 -ast-dump -fcolor-diagnostics -std=c++20 test.cpp
`-CXXRecordDecl 0x1530f3a78 <test.cpp:11:1, col:10> col:7 class A definition
  |-FullComment 0x1530fea38 <line:2:4, line:9:23>
  | |-ParagraphComment 0x1530fe7e0 <line:2:4>
  | | `-TextComment 0x1530fe7b8 <col:4> Text=" "
  | |-BlockCommandComment 0x1530fe800 <col:5, line:3:30> Name="par"
  | | `-ParagraphComment 0x1530fe878 <line:2:9, line:3:30>
  | |   |-TextComment 0x1530fe828 <line:2:9, col:32> Text=" User defined paragraph:"
  | |   `-TextComment 0x1530fe848 <line:3:4, col:30> Text=" Contents of the paragraph."
  | |-ParagraphComment 0x1530fe8c0 <line:5:4>
  | | `-TextComment 0x1530fe898 <col:4> Text=" "
  | |-BlockCommandComment 0x1530fe8e0 <col:5, line:6:41> Name="par"
  | | `-ParagraphComment 0x1530fe930 <col:4, col:41>
  | |   `-TextComment 0x1530fe908 <col:4, col:41> Text=" New paragraph under the same heading."
  | |-ParagraphComment 0x1530fe978 <line:8:4>
  | | `-TextComment 0x1530fe950 <col:4> Text=" "
  | `-BlockCommandComment 0x1530fe998 <col:5, line:9:23> Name="par"
  |   `-ParagraphComment 0x1530fe9e8 <col:4, col:23>
  |     `-TextComment 0x1530fe9c0 <col:4, col:23> Text=" A second paragraph."
  `-CXXRecordDecl 0x1530f3bb0 <line:11:1, col:7> col:7 implicit class A
```

As we can see above, the optional paragraph heading (`"User defined
paragraph"`) is not an argument of the `\par` `BlockCommandComment`, but
instead a child `TextComment`.

For documentation generators like [hdoc](https://hdoc.io/), it would be
ideal if we could parse Doxygen documentation comments with these
semantics in mind. Currently that's not possible.

### Change

This change parses `\par` command according to how Doxygen parses them,
making an optional header available as a an argument if it is present.
In addition:

- AST unit tests are defined to test this functionality when an argument
is present, isn't present, with additional spacing, etc.
- TableGen is updated with an `IsParCommand` to support this
functionality
- `lit` tests are updated where needed
Since commit 8d468c1, the header
`openmp_wrappers/complex` is hidden behind `openmp_wrappers/complex.h`
due to a bug in CMake[^1], so is not actually installed.

To test the issue, you can ask `ninja` to generate the file on your
build:

```
$ ninja lib/clang/19/include/openmp_wrappers/complex.h
[199/199] Copying clang's openmp_wrappers/complex.h...
$ ninja lib/clang/19/include/openmp_wrappers/complex
ninja: error: unknown target 'lib/clang/19/include/openmp_wrappers/complex', did you mean 'lib/clang/19/include/openmp_wrappers/complex.h'? 
```

Re-ordering the entries workarounds the issue. The other option is to
revert the cited commit, but I'm not sure which approach is preferred.

CC @etcwilde @jdoerfert 

[^1]: [Here](https://gitlab.kitware.com/cmake/cmake/-/issues/26058) is
the CMake report on the issue.
Add additional test with salarized store which caused crashes with
earlier versions of llvm#92555.
This reverts commit 6f538f6.

Extra tests for crashes discovered when building Chromium have been
added in fb86cb7, 3be7312.

Original message:
This adds a new interface to compute the cost of recipes, VPBasicBlocks,
VPRegionBlocks and VPlan, initially falling back to the legacy cost model
for all recipes. Follow-up patches will gradually migrate recipes to
compute their own costs step-by-step.

It also adds getBestPlan function to LVP which computes the cost of all
VPlans and picks the most profitable one together with the most
profitable VF.

The VPlan selected by the VPlan cost model is executed and there is an
assert to catch cases where the VPlan cost model and the legacy cost
model disagree. Even though I checked a number of different build
configurations on AArch64 and X86, there may be some differences
that have been missed.

Additional discussions and context can be found in @arcbbb's
llvm#67647 and
llvm#67934 which is an earlier
version of the current PR.

PR: llvm#92555
The main goal of this PR (and subsequent PRs), is to add more tests with
scalable vectors to:
  * vector-transfer-collapse-inner-most-dims.mlir

There's quite a few cases to consider, hence this is split into multiple
PRs. In this PR, the very first test for `vector.transfer_write` is
complemented with all the possible combinations:
  * scalable (rather than fixed) unit trailing dim,
  * dynamic (rather than static) trailing dim in the source memref.

To this end, the following tests:
  * `@leading_scalable_dimension_transfer_write`
    `@trailing_scalable_one_dim_transfer_write`

are replaced with:
  * `@drop_two_inner_most_dim_scalable_inner_dim` and
    `@negative_scalable_unit_dim`,

respectively. In addition:
  * "_for_transfer_write" is removed from function names (to reduce
    noise).

In addition, to maintain consistency between the tests for `xfer_read`
and `xfer_write`, 2 negative tests for `xfer_read` are also renamed.
This is to follow the suggestion made during the review of this PR.

Extra comments in "VectorTransforms.cpp" are added to better
document the limitations related to scalable vectors and which tests
added here excercise.

This is a follow-up for: llvm#94490 and llvm#94604

NOTE: This PR is limited to tests for `vector.transfer_write`.
This test occasionally fails on two of the busiest CI bots (asan and
matrix), and we can't reproduce it locally. This leads to the
hypothesis that the test is timing out (in the sense of the number of
"join attempts" performed by this test's driver).

This commit doubles the number of iterations performed and also does
an NFC refactor of the main test loop so that it can be more easily
understood.
… contexts (llvm#95963)"

This reverts commit dadf960.

The commit caused `TestEarlyProcessLaunch.py` to fail on the
macOS bots.
This is the paper that added the 'restrict' keyword. Clang is
conforming to the letter of the standard's requirements, so it would be
defensible for us to claim full support instead. However, LLVM does not
currently support the optimization semantics with restricted local
variables or data members, only with restricted pointers declared in
function parameters. So we're only claiming partial support because we
don't yet take full advantage of what the feature allows.
This change by itself has no measurable effect on the LLDB
testsuite. I'm making it in preparation for threading through more
errors in the Swift language plugin.
This is de facto an NFC change for Objective-C but will benefit the
Swift language plugin.
This pass is not used in any pipeline, barely used in tests and not
really useful, so drop it. The only place where we "repeat" passes is
devirt repetition, and that is done using a separate pass.
…nit (llvm#95580)

This checks if the layout of `std::initializer_list` is something Clang
can handle much earlier and deduplicates the checks in
CodeGen/CGExprAgg.cpp and AST/ExprConstant.cpp

Also now diagnose `union initializer_list` (Fixes llvm#95495), bit-field for
the size (Fixes a crash that would happen during codegen if it were
unnamed), base classes (that wouldn't be initialized) and polymorphic
classes (whose vtable pointer wouldn't be initialized).
…lvm#96013)

Make LanguageRuntime::GetTypeBitSize return an optional. This should be
NFC, though the ObjCLanguageRuntime implementation is (possibly) more
defensive against returning 0.

I'm not sure if it's possible for both `m_ivar.size` and `m_ivar.offset`
to be zero. Previously, we'd return 0 and cache it, only to discard it
the next time when finding it in the cache, and recomputing it again.
The new code will avoid putting it in the cache in the first place.
Michael137 and others added 29 commits June 22, 2024 17:07
This formatter doesn't currently provide much value. It only formats
`SourceLocation` and `QualType`. The only formatting it does for
`QualType` is call `getAsString()` on it.

The main motivator for the removal however is that the formatter
implementation can be very slow (since it uses the expression evaluator
in non-trivial ways).

Not infrequently do we get reports about LLDB being slow when debugging
Clang, and it turns out the user was loading `ClangDataFormat.py` in
their `.lldbinit` by default.

We should eventually develop proper formatters for Clang data-types, but
these are currently not ready. So this patch removes them in the
meantime to avoid users shooting themselves in the foot, and giving the
wrong impression of these being reference implementations.
Fold `mul (uitofp i1 X), Y` to `select i1 X, Y, 0.0` when the `mul` is
`nnan` and `nsz`

Proof: https://alive2.llvm.org/ce/z/_stiPm
We're ultimately expected to return an APValue simply pointing to
the CallExpr, not any useful value. Do that by creating a global
variable for the call.
…k. NFC

This was added by 507efbc
([MC] Fold A-B when A is a pending label or A/B are separated by a
MCFillFragment) to account for pending labels and is now unneeded after
the removal of pending labels (7500646).
The checks when building a thunk to decide if an arg needed to be cast
to/from an integer or redirected via a pointer didn't match how arg
types were changed in `canonicalizeThunkType`, this caused LLVM to ICE
when using vector types as args due to incorrect types in a call
instruction.

Instead of duplicating these checks, we should check if the arg type
differs between x64 and AArch64 and then cast or redirect as
appropriate.
…llvm#96396)

Reapply 4a7bf42
which was reverted in 34d44eb

Not sure why there are tests elsewhere in clang that rely on the output
of clang-format, but they were wrong
…at_provider (llvm#95704)

The original implementation of HelperFunctions::consumeHexStyle always
sets Style when it returns true, but this is difficult for a compiler
to understand since it requires seeing that Str starts
with either an "x" or an "X" when starts_with_insensitive("x")
return true.
In particular, g++ 12 warns that HS may be used uninitialized
in the format_provider::format caller.

Change HelperFunctions::consumeHexStyle to return an optional
HexPrintStyle and to make the fact that Str necessarily starts
with an "X" when all other cases do not apply more explicit.
This helps both the compiler and the human reader of the code.

Co-authored-by: Sven Verdoolaege <sven.verdoolaege@gmail.com>
llvm#95197 and 7500646 eliminated all raw
`new MCXXXFragment`. We can now place fragments in a bump allocator.
In addition, remove the dead `Kind == FragmentType(~0)` condition.

~CodeViewContext may call `StrTabFragment->destroy()` and need to be
reset before `FragmentAllocator.Reset()`.
Tested by llvm/test/MC/COFF/cv-compiler-info.ll using asan.

Pull Request: llvm#96402
https://reviews.llvm.org/D67249 added content hash (see
-fvalidate-ast-input-files-content) using llvm::hash_code (size_t).
The hash value is 32-bit on 32-bit systems, which was unintentional.

Fix llvm#96379: llvm#96136 switched the hash function to xxh3_64bit but did not
update the ContentHash type, leading to mismatch between ASTReader and
ASTWriter.
This change is part of this proposal:
https://discourse.llvm.org/t/rfc-all-the-math-intrinsics/78294

This is part 3 of 4 PRs. It sets the ground work for using the
intrinsics in HLSL.

Add HLSL frontend apis for `acos`, `asin`, `atan`, `cosh`, `sinh`, and
`tanh`
llvm#70079
llvm#70080
llvm#70081
llvm#70083
llvm#70084
llvm#95966
…n-constants

If f(Y) simplifies to Y, replace with Y. This requires Y to be
non-undef.

Closes llvm#94719
Follow-up to 05ba5c0. uint32_t is
preferred over const MCExpr * in the section stack uses because it
should only be evaluated once. Change the paramter type to match.
Functions that have the `nvvm.kernel` attribute should have 0 results.
The `gpu.func` op lowering accounts for memref arguments/results (both
"normal" and bare-pointer supported), but the `gpu.return` op lowering
did not. The lowering produced invalid IR that did not verify.

This commit uses the same lowering strategy as for `func.return` in the
`gpu.return` lowering. (The C++ implementation is copied. We may want to
share some code between `func` and `gpu` lowerings in the future.)
Define subtarget features for atomic fmin/fmax support.

The flat/global support is a real messe. We had float/double support at
the beginning in gfx6 and gfx7. gfx8 removed these. gfx10 reintroduced them.
gfx11 removed the f64 versions again.

gfx9 partially reintroduced them, in gfx90a and gfx940 but only for f64.
These have been replaced with atomicrmw fadd
@zjaffal zjaffal closed this Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.