Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MC] Compiler performance regression in Clang 19 with -mbranches-within-32B-boundaries #107754

Open
vient opened this issue Sep 8, 2024 · 9 comments
Labels
LTO Link time optimization (regular/full LTO or ThinLTO)

Comments

@vient
Copy link
Member

vient commented Sep 8, 2024

I'm building the same code with clang 18 and 19, and noticed that some target build times are disproportionately affected by switching to new compiler - in general Clang 19 is 5-10% slower but an LTO build of one particular target slowed down x2.5

Tried --time-trace but don't know what to make of it other than that OptModule got some long tails in Clang 19. First worker under main thread is building the same module in both images so can be directly compared - OptModule time increased from 1m20s to 5m24s, x4
image

913.621213 Total OptModule
856.716409 Total OptFunction
856.192565 Total RunPass
556.340514 Total PassManager<Function>
512.635569 Total ModuleInlinerWrapperPass
510.885891 Total ModuleToPostOrderCGSCCPassAdaptor
509.09462 Total DevirtSCCRepeatedPass
507.621024 Total PassManager<LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult &>
434.932548 Total CGSCCToFunctionPassAdaptor
142.495075 Total ExecuteLinker
142.421367 Total Link
141.506523 Total LTO
132.923099 Total InstCombinePass
124.003487 Total ModuleToFunctionPassAdaptor

image

3237.53794 Total OptModule
845.04484 Total OptFunction
844.38391 Total RunPass
552.922664 Total PassManager<Function>
497.867448 Total ModuleInlinerWrapperPass
495.840083 Total ModuleToPostOrderCGSCCPassAdaptor
493.816647 Total DevirtSCCRepeatedPass
492.195245 Total PassManager<LazyCallGraph::SCC, CGSCCAnalysisManager, LazyCallGraph &, CGSCCUpdateResult &>
417.747014 Total CGSCCToFunctionPassAdaptor
385.505297 Total ExecuteLinker
385.437975 Total Link
384.301031 Total LTO
141.092082 Total InstCombinePass
137.907089 Total ModuleToFunctionPassAdaptor

perf trace and manual breaking in gdb show that a lot of time is spent around

llvm::MCAssembler::layout() ()
llvm::MCObjectStreamer::finishImpl() ()
llvm::MCELFStreamer::finishImpl() ()
llvm::AsmPrinter::doFinalization(llvm::Module&) ()
llvm::FPPassManager::doFinalization(llvm::Module&) ()
llvm::legacy::PassManagerImpl::run(llvm::Module&) ()

and also llvm::MCExpr::evaluateAsRelocatableImpl. My current build is stripped though, I'll return back with trace results with debug symbols later.

@github-actions github-actions bot added clang Clang issues not falling into any other category lld labels Sep 8, 2024
@EugeneZelenko EugeneZelenko added LTO Link time optimization (regular/full LTO or ThinLTO) and removed clang Clang issues not falling into any other category lld labels Sep 8, 2024
@vient
Copy link
Member Author

vient commented Sep 8, 2024

@MaskRay you have recent commits in evaluateAsRelocatable - may you have an idea what changes in LLVM 19 can cause such regression?

@vient
Copy link
Member Author

vient commented Sep 9, 2024

Top functions
image
machine code part became a lot slower in LLVM 19, there are no MC functions near the top in LLVM 18.

Don't know why perf does not show inlined functions, here are hottest instructions of first three functions

llvm::ELFObjectWriter::isSymbolRefDifferenceFullyResolvedImpl(llvm::MCAssembler const&, llvm::MCSymbol const&, llvm::MCFragment const&, bool, bool) const at llvm/lib/MC/ELFObjectWriter.cpp:1447:29
 (inlined by) llvm::MCObjectWriter::isSymbolRefDifferenceFullyResolved(llvm::MCAssembler const&, llvm::MCSymbolRefExpr const*, llvm::MCSymbolRefExpr const*, bool) const at llvm/lib/MC/MCObjectWriter.cpp:45:10
 (inlined by) AttemptToFoldSymbolOffsetDifference(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCSymbolRefExpr const*&, llvm::MCSymbolRefExpr const*&, long&) at llvm/lib/MC/MCExpr.cpp:601:25
 (inlined by) evaluateSymbolicAdd(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCValue const&, llvm::MCValue const&, llvm::MCValue&) at llvm/lib/MC/MCExpr.cpp:768:5
 (inlined by) llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:950:16

llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:819:3

evaluateSymbolicAdd(llvm::MCAssembler const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool, llvm::MCValue const&, llvm::MCValue const&, llvm::MCValue&) at llvm/lib/MC/MCExpr.cpp:755:7
 (inlined by) llvm::MCExpr::evaluateAsRelocatableImpl(llvm::MCValue&, llvm::MCAssembler const*, llvm::MCFixup const*, llvm::DenseMap<llvm::MCSection const*, unsigned long, llvm::DenseMapInfo<llvm::MCSection const*, void>, llvm::detail::DenseMapPair<llvm::MCSection const*, unsigned long>> const*, bool) const at llvm/lib/MC/MCExpr.cpp:950:16



llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1285:3
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10

llvm::MCAssembler::relaxBoundaryAlign(llvm::MCBoundaryAlignFragment&) at llvm/lib/MC/MCAssembler.cpp:1189:8
 (inlined by) llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1299:12
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10

 llvm::SmallVectorBase<unsigned long>::size() const at llvm/include/llvm/ADT/SmallVector.h:92:32
 (inlined by) llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:0:0
 (inlined by) llvm::MCAssembler::relaxBoundaryAlign(llvm::MCBoundaryAlignFragment&) at llvm/lib/MC/MCAssembler.cpp:1195:20
 (inlined by) llvm::MCAssembler::relaxFragment(llvm::MCFragment&) at llvm/lib/MC/MCAssembler.cpp:1299:12
 (inlined by) llvm::MCAssembler::layoutOnce() at llvm/lib/MC/MCAssembler.cpp:1315:11
 (inlined by) llvm::MCAssembler::layout() at llvm/lib/MC/MCAssembler.cpp:941:10



llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:251:3
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:447:15

llvm::MCAssembler::isBundlingEnabled() const at llvm/include/llvm/MC/MCAssembler.h:208:59
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:443:9

llvm::MCBoundaryAlignFragment::getSize() const at llvm/include/llvm/MC/MCFragment.h:580:37
 (inlined by) llvm::MCAssembler::computeFragmentSize(llvm::MCFragment const&) const at llvm/lib/MC/MCAssembler.cpp:281:45
 (inlined by) llvm::MCAssembler::ensureValid(llvm::MCSection&) const at llvm/lib/MC/MCAssembler.cpp:447:15

@vient
Copy link
Member Author

vient commented Sep 11, 2024

Don't know how I missed this post https://maskray.me/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19
@aengelke do you know if this slowdown is expected? I get from the post that mentioned code parts are supposed to become faster in LLVM 19?

@aengelke
Copy link
Contributor

Which architecture? Is this NaCl? (NaCl regressions might be caused by #94950, where I removed MCCompactEncodedInstFragment.) Other than NaCl, this looks like a regression. MaskRay was working on layouting.

@vient
Copy link
Member Author

vient commented Sep 13, 2024

x86_64, not NaCl. I think I'm onto something - difference went away when I removed these options

-Wall
-Wextra
-Werror
-pedantic
-Wold-style-cast
-fvisibility=hidden
-fvisibility-inlines-hidden
-Wconversion
-Wsign-conversion
-Wunreachable-code
-Wno-missing-braces
-Wframe-larger-than=2500000
-ffile-prefix-map=/home/rlozko/git/twix=.
-fveclib=libmvec
-fdiagnostics-absolute-paths
-Wno-error=deprecated-declarations
-mbranches-within-32B-boundaries
-Wno-gnu-zero-variadic-macro-arguments
-Wno-enum-constexpr-conversion
-Wno-deprecated-declarations
-fcolor-diagnostics

I'll post later what options exactly affect this - the process is slow, each run takes 20-40 minutes :)

@vient
Copy link
Member Author

vient commented Sep 13, 2024

Got it, slowdown goes away when -mbranches-within-32B-boundaries is removed - in my case it speeds up linkage more than 2 times. Can't find any recent commits related to this flag, sounds directly related to code layout.

@vient vient changed the title [LTO] [lld:ELF] Compiler performance regression in Clang 19 [MC] Performance regression in Clang 19 with -mbranches-within-32B-boundaries Sep 13, 2024
@vient vient changed the title [MC] Performance regression in Clang 19 with -mbranches-within-32B-boundaries [MC] Compiler performance regression in Clang 19 with -mbranches-within-32B-boundaries Sep 13, 2024
@aengelke
Copy link
Contributor

Thanks for investigating! This makes some sense, with this option, every instruction gets a new, separate fragment, so that relaxations can be applied later. The code path isn't optimized, as the option is rarely used. Not sure what's causing the regression compared to LLVM 18, though.

@MaskRay
Copy link
Member

MaskRay commented Sep 15, 2024

Don't know how I missed this post maskray.me/blog/2024-06-30-integrated-assembler-improvements-in-llvm-19

The way we relax MCFragments might be related. It's possible that uncommon configurations like -mbranches-within-32B-boundaries are regressed while normal code paths get faster. Complex expression evaluation, primarily used by the Linux kernel, imposes relaxation schemes we could apply (#100283). I believe it's challenging to ensure that every use case is fast. The current way that optimizes the normal code path and penalizes uncommon -mbranches-within-32B-boundaries is likely favorable.

@vient
Copy link
Member Author

vient commented Sep 15, 2024

We use this option because some of our hosts are Skylake-based, and some workloads are affected by JCC erratum - don't know why the others workloads are not. For a workaround, I've put -mbranches-within-32B-boundaries under if(ARCH MATCHES "^(skylake|cascadelake)"). It occurred that, strangely, the same workloads that benefit from this option on Skylake (~5% improvement) are negatively affected by it on other platforms (~2% slowdown).

Overall, can't say that this issue affects us in a serious way. If I understand right that this issue gets a WONTFIX by you, it can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

No branches or pull requests

4 participants