Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big compile time regression for Hexagon Linux kernel build after commit 90ba33099c #80185

Open
nathanchance opened this issue Jan 31, 2024 · 10 comments

Comments

@nathanchance
Copy link
Member

After 90ba330, I see a very noticeable build time regression when building drivers/net/wireless/ralink/rt2x00/rt2800lib.c from the Linux kernel for ARCH=hexagon at -O2. I am still teasing out a concise reproducer but my reduction is taking quite long so I am just posting what I have so far to raise visibility on this. I'll post a potentially more concise reproducer once my reduction finishes.

With a base command of clang -fno-short-enums --target=hexagon-linux-musl -O2 -c -o /dev/null rt2800lib.i:

Benchmark 1: clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O0
  Time (abs ≡):         3.024 s               [User: 2.924 s, System: 0.100 s]

Benchmark 2: clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O2
  Time (abs ≡):        42.393 s               [User: 42.293 s, System: 0.100 s]

Benchmark 3: clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O0
  Time (abs ≡):         2.646 s               [User: 2.586 s, System: 0.060 s]

Benchmark 4: clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O2
  Time (abs ≡):        2113.743 s               [User: 2113.603 s, System: 0.110 s]

Summary
  clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O0 ran
    1.14 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O0
   16.02 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O2
  798.75 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O2

rt2800lib.i.txt

This appears related to the compile time ffs() macros this driver has implemented, based on the massive amount of __builtin_choose_expr() calls there are, as FIELD32 expands to compile_ffs32, which ultimately expands all of the other compile_ffs macros down to compile_ffs2... Perhaps the kernel should not be doing this (and that's reflected in the original baseline of 42s) but I think a 60x compile time regression should not happen either. I have not seen this level of regression in any of my other builds for other architectures, so perhaps this is related to something specific to Hexagon?

cc @nikic @androm3da @SundeepKushwaha @SidManning

Bisect log
# bad: [795090739cf3b295be750dfba0af2ba993e60cdd] [RISCV] Fix a bug accidentally introduced in e9311f9
# good: [4c3de45ecf9eea6b4ad850a042706f7865a2aab2] [LoongArch][test] Add tests reporting error if lsx/lasx feature is missing when lsx/lasx builtins are called (#79250)
git bisect start '795090739cf3b295be750dfba0af2ba993e60cdd' '4c3de45ecf9eea6b4ad850a042706f7865a2aab2'
# bad: [8d43dad9b86ad0f72100b6f75450f2982f2663b9] [clang][bazel] Fix BUILD after 4a582845597e97d245e8ffdc14281f922b835e56.
git bisect bad 8d43dad9b86ad0f72100b6f75450f2982f2663b9
# good: [9dddb3d5f3bf323b7b7f8281bb848731f69fddfa] [AST] Mark the fallthrough coreturn statement implicit. (#77465)
git bisect good 9dddb3d5f3bf323b7b7f8281bb848731f69fddfa
# good: [98509c7f9792c79b05a41b95c24607f6dd489c5a] [AArch64] Add vec3 tests with different load/store alignments.
git bisect good 98509c7f9792c79b05a41b95c24607f6dd489c5a
# bad: [aaa93ce7323332d8290b8f563d4d71689c1094c5] compiler-rt: Fix FLOAT16 feature detection
git bisect bad aaa93ce7323332d8290b8f563d4d71689c1094c5
# bad: [03a9f07e189db792b001c4001981d6e2da880221] [libc++][NFC] Fix leftover && in comment
git bisect bad 03a9f07e189db792b001c4001981d6e2da880221
# bad: [3d91d9613e294b242d853039209b40a0cb7853f2] [ConstraintElim] Make sure min/max intrinsic results are not poison.
git bisect bad 3d91d9613e294b242d853039209b40a0cb7853f2
# bad: [90ba33099cbb17e7c159e9ebc5a512037db99d6d] [InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
git bisect bad 90ba33099cbb17e7c159e9ebc5a512037db99d6d
# first bad commit: [90ba33099cbb17e7c159e9ebc5a512037db99d6d] [InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)
@SundeepKushwaha
Copy link
Contributor

@nathanchance, can you please share the reduced test case? We can investigate if there is anything in Hexagon backend side which might be causing this regression.

@nathanchance
Copy link
Member Author

@nathanchance, can you please share the reduced test case? We can investigate if there is anything in Hexagon backend side which might be causing this regression.

@SundeepKushwaha Sure thing, it was:

rt2800lib.i.txt

from the original message but that definitely could have been more noticeable on my part. It has some pretty lengthy macro expansions so I didn't want to copy and paste it into the thread.

@nikic
Copy link
Contributor

nikic commented Feb 1, 2024

Optimized IR for the test case: https://gist.github.com/nikic/4061005b5d69328811c3593f55e59c49

Running llc under perf shows that all the time is spent in RDFGraph, which I believe is a Hexagon specific thing.

@SundeepKushwaha
Copy link
Contributor

Thanks @nikic and @nathanchance.

@iajbar, can you please investigate?

@iajbar
Copy link
Contributor

iajbar commented Feb 6, 2024

We are working on upstreaming the fix very soon. Thanks @nikic and @nathanchance.

@quic-asaravan
Copy link
Contributor

#81071 - the fix is merged @nikic @nathanchance

@nathanchance
Copy link
Member Author

Thanks, my build times agree.

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 13, 2024

@llvm/issue-subscribers-backend-hexagon

Author: Nathan Chancellor (nathanchance)

After https://github.com/llvm/llvm-project/commit/90ba33099cbb17e7c159e9ebc5a512037db99d6d, I see a very noticeable build time regression when building [`drivers/net/wireless/ralink/rt2x00/rt2800lib.c`](https://elixir.bootlin.com/linux/v6.8-rc2/source/drivers/net/wireless/ralink/rt2x00/rt2800lib.c) from the Linux kernel for `ARCH=hexagon` at `-O2`. I am still teasing out a concise reproducer but my reduction is taking quite long so I am just posting what I have so far to raise visibility on this. I'll post a potentially more concise reproducer once my reduction finishes.

With a base command of clang -fno-short-enums --target=hexagon-linux-musl -O2 -c -o /dev/null rt2800lib.i:

Benchmark 1: clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O0
  Time (abs ≡):         3.024 s               [User: 2.924 s, System: 0.100 s]

Benchmark 2: clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O2
  Time (abs ≡):        42.393 s               [User: 42.293 s, System: 0.100 s]

Benchmark 3: clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O0
  Time (abs ≡):         2.646 s               [User: 2.586 s, System: 0.060 s]

Benchmark 4: clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O2
  Time (abs ≡):        2113.743 s               [User: 2113.603 s, System: 0.110 s]

Summary
  clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O0 ran
    1.14 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O0
   16.02 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 98509c7f9792c79b05a41b95c24607f6dd489c5a) @ -O2
  798.75 times faster than clang version 19.0.0git (https://github.com/llvm/llvm-project 90ba33099cbb17e7c159e9ebc5a512037db99d6d) @ -O2

rt2800lib.i.txt

This appears related to the compile time ffs() macros this driver has implemented, based on the massive amount of __builtin_choose_expr() calls there are, as FIELD32 expands to compile_ffs32, which ultimately expands all of the other compile_ffs macros down to compile_ffs2... Perhaps the kernel should not be doing this (and that's reflected in the original baseline of 42s) but I think a 60x compile time regression should not happen either. I have not seen this level of regression in any of my other builds for other architectures, so perhaps this is related to something specific to Hexagon?

cc @nikic @androm3da @SundeepKushwaha @SidManning

<details>
<summary>Bisect log</summary>

# bad: [795090739cf3b295be750dfba0af2ba993e60cdd] [RISCV] Fix a bug accidentally introduced in e9311f9
# good: [4c3de45ecf9eea6b4ad850a042706f7865a2aab2] [LoongArch][test] Add tests reporting error if lsx/lasx feature is missing when lsx/lasx builtins are called (#<!-- -->79250)
git bisect start '795090739cf3b295be750dfba0af2ba993e60cdd' '4c3de45ecf9eea6b4ad850a042706f7865a2aab2'
# bad: [8d43dad9b86ad0f72100b6f75450f2982f2663b9] [clang][bazel] Fix BUILD after 4a582845597e97d245e8ffdc14281f922b835e56.
git bisect bad 8d43dad9b86ad0f72100b6f75450f2982f2663b9
# good: [9dddb3d5f3bf323b7b7f8281bb848731f69fddfa] [AST] Mark the fallthrough coreturn statement implicit. (#<!-- -->77465)
git bisect good 9dddb3d5f3bf323b7b7f8281bb848731f69fddfa
# good: [98509c7f9792c79b05a41b95c24607f6dd489c5a] [AArch64] Add vec3 tests with different load/store alignments.
git bisect good 98509c7f9792c79b05a41b95c24607f6dd489c5a
# bad: [aaa93ce7323332d8290b8f563d4d71689c1094c5] compiler-rt: Fix FLOAT16 feature detection
git bisect bad aaa93ce7323332d8290b8f563d4d71689c1094c5
# bad: [03a9f07e189db792b001c4001981d6e2da880221] [libc++][NFC] Fix leftover &amp;&amp; in comment
git bisect bad 03a9f07e189db792b001c4001981d6e2da880221
# bad: [3d91d9613e294b242d853039209b40a0cb7853f2] [ConstraintElim] Make sure min/max intrinsic results are not poison.
git bisect bad 3d91d9613e294b242d853039209b40a0cb7853f2
# bad: [90ba33099cbb17e7c159e9ebc5a512037db99d6d] [InstCombine] Canonicalize constant GEPs to i8 source element type (#<!-- -->68882)
git bisect bad 90ba33099cbb17e7c159e9ebc5a512037db99d6d
# first bad commit: [90ba33099cbb17e7c159e9ebc5a512037db99d6d] [InstCombine] Canonicalize constant GEPs to i8 source element type (#<!-- -->68882)

</details>

@nathanchance
Copy link
Member Author

Unfortunately, after commit 1a78f8cb5daa ("fortify: Allow KUnit test to build without FORTIFY") in the Linux kernel, which allows another test file to be built for Hexagon where it previously was not and was merged around the same time as the above fix, I am noticing a very similar pathological compile time impact of 90ba330, which is not cured by ac05771. If I should open a separate issue for this and reclose this one, I am more than happy to.

Benchmark 1: Clang @ 98509c7f9792 ("[AArch64] Add vec3 tests with different load/store alignments.")
  Time (abs ≡):         5.923 s               [User: 5.793 s, System: 0.130 s]

Benchmark 2: Clang @ 90ba33099cbb ("[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)")
  Time (abs ≡):        742.763 s               [User: 742.616 s, System: 0.130 s]

Benchmark 3: Clang @ ac0577177f05 ("[HEXAGON] Add basic block limit for RDF optimizations (#81071)")
  Time (abs ≡):        748.402 s               [User: 748.265 s, System: 0.120 s]

Benchmark 4: Clang @ 2ceec68e1630 ("[ValueTypes] Rename FlagVT to Glue in ValueTypes.td. NFC")
  Time (abs ≡):        747.341 s               [User: 747.182 s, System: 0.140 s]

Summary
  Clang @ 98509c7f9792 ("[AArch64] Add vec3 tests with different load/store alignments.") ran
  125.41 times faster than Clang @ 90ba33099cbb ("[InstCombine] Canonicalize constant GEPs to i8 source element type (#68882)")
  126.18 times faster than Clang @ 2ceec68e1630 ("[ValueTypes] Rename FlagVT to Glue in ValueTypes.td. NFC")
  126.36 times faster than Clang @ ac0577177f05 ("[HEXAGON] Add basic block limit for RDF optimizations (#81071)")

While this is a test file that has a lot of macro expansions for concise test case writing, I think over ten minutes might be a little excessive. Unfortunately, in my quest to come up with a concise reproducer, I have ended up creating something larger than the original test case :) so I have just gone ahead and attached the original preprocessed file from the kernel tree. It can be built with

clang --target=hexagon-linux -O2 -ffixed-r19 -c -o /dev/null fortify_unit.i

to demonstrate the behavior.

fortify_kunit.i.txt

@nathanchance nathanchance reopened this Jun 24, 2024
@iajbar
Copy link
Contributor

iajbar commented Jun 26, 2024

Thank you Nathan. We are working on debugging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants