Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LTO support and C/CXX flags override hard to achieve #97

Open
profi200 opened this issue Feb 7, 2021 · 17 comments
Open

LTO support and C/CXX flags override hard to achieve #97

profi200 opened this issue Feb 7, 2021 · 17 comments
Assignees
Labels
Milestone

Comments

@profi200
Copy link

profi200 commented Feb 7, 2021

First of all thanks for making this so painless. It worked basically right out of the box even with a different arm-none-eabi toolchain.

LTO support
Linking is failing with these errors:

/tmp/ccbJpNID.s: Assembler messages:
/tmp/ccbJpNID.s:25: Error: selected processor does not support `sev' in ARM mode
/tmp/ccbJpNID.s:53: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:117: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:406: Error: selected processor does not support `bkpt #0' in ARM mode
/tmp/ccbJpNID.s:427: Error: selected processor does not support `bkpt #0' in ARM mode
/tmp/ccbJpNID.s:449: Error: selected processor does not support requested special purpose register -- `mrs lr,PRIMASK'
/tmp/ccbJpNID.s:452: Error: selected processor does not support `cpsid i' in ARM mode
/tmp/ccbJpNID.s:463: Error: selected processor does not support `dmb' in ARM mode
/tmp/ccbJpNID.s:501: Error: selected processor does not support `dmb' in ARM mode
/tmp/ccbJpNID.s:510: Error: selected processor does not support requested special purpose register -- `msr PRIMASK,lr'
/tmp/ccbJpNID.s:583: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:618: Error: selected processor does not support `blx r3' in ARM mode
/tmp/ccbJpNID.s:714: Error: selected processor does not support `blx r2' in ARM mode
/tmp/ccbJpNID.s:729: Error: selected processor does not support `blx r3' in ARM mode
lto-wrapper: fatal error: /opt/devkitpro/devkitARM/bin/arm-none-eabi-g++ returned 1 exit status
compilation terminated.
/opt/devkitpro/devkitARM/lib/gcc/arm-none-eabi/10.1.0/../../../../arm-none-eabi/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status
make[2]: *** [CMakeFiles/blink.dir/build.make:691: blink.elf] Fehler 1
make[2]: Verzeichnis „/home/blub/pico/blink/build“ wird verlassen
make[1]: *** [CMakeFiles/Makefile2:1494: CMakeFiles/blink.dir/all] Fehler 2
make[1]: Verzeichnis „/home/blub/pico/blink/build“ wird verlassen
make: *** [Makefile:103: all] Fehler 2

These errors are quite strange since it's usually the other way around with something not being supported in thumb mode. Have not digged too deep yet to find out what's causing these.

C/CXX flags
I tried to override the C and linker flags to have different optimization than default by doing

cmake .. -DCMAKE_C_FLAGS="-O2 -flto -fno-data-sections" -DCMAKE_EXE_LINKER_FLAGS="-O2 -flto"

The result is this order of flags to the compiler:
...-I/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/include -O2 -flto -fno-data-sections -O3 -DNDEBUG -ffunction-sections -fdata-sections -o CMakeFiles/blink.dir/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c.obj -c /home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c
Which means it's basically overriding my flags instead of the other way around. Somewhere it's including the override flags before the usual SDK given flags.

Would be nice if at least the C/CXX flags issue can be fixed. I understand if LTO is not happening for a while since it tends to uncover undefined behavior or bugs not seen without but in my experience it does help shrink down code size more.

Full build log: https://gist.github.com/profi200/718e20b9d3ebec87db60f78d0deb994b

@kilograham
Copy link
Contributor

if you use target_compile_definitions or target_compile_optionson your target/library you should get the right thing.

Note that -O2 is the default (for "Release" or "RelWithDebugInfo" CMAKE_BUILD_TYPE), -Og for "Debug", -Os for "MinSizeRel"

First of all thanks for making this so painless. It worked basically right out of the box even with a different arm-none-eabi toolchain.

Thanks, we have tried hard to make things "easy" for the beginner, but give the power user full control. This is quite hard to do (especially when we have both native and device compilation going on). So there are certainly things that are still a little hidden/hard coded

(Note if you search elsewhere you can also configure per source flags with set_source_file_properties.

-flto is on our list to try (but haven't done so yet for anything other than the bootrom). I suspect it will drop stuff on the floor at the moment. Plus we have been thinking with -gc-sections in mind, but that shouldn't make much odds.

Note things you can try to tweak your build config further (if you should so wish) other than just editing what is there.

  1. Sledgehammer: Make a new build type like cmake/preload/toolchains/pico_arm_gcc.cmake and set PICO_COMPILER
  2. clone pico_standard_link library, and either give it a new name and explicitly direct on your version instead (rather than via pico_stdlib), or perhaps more easily, define set(SKIP_PICO_STANDARD_LINK 1) at the top of your CMakeLists.txt then add your own pico_standard_link version after the project init.

Some of this stuff might be a little off the happy path for now, but we are keen to improve.

@kilograham kilograham changed the title LTO support and C/CXX flags override broken LTO support and C/CXX flags override hard to achieve Feb 8, 2021
@profi200
Copy link
Author

profi200 commented Feb 8, 2021

Are you sure? It does use -O3 here for release builds. Not sure if this is due to some cmake system default (i definitely didn't set that). Maybe that should be explicitly set in the SDK. And btw i have always been disabling -fdata-sections because it can have a negative impact. I stumbled upon these issues when i noticed even the blink example is 12 KiB in size which is a bit heavy for something that toggles a GPIO + delays.

I have barely any experience with cmake and rather stayed away from it when possible. I guess i need to dive deeper into what i can and can't set in CMakeLists.txt.

Yes, i know the pain. LTO can do insane optimizations that break non-standard conforming or code with undefined behavior in weird ways. Like moving global variable writes completely elsewhere when other functions depend on it.

edit:
Oh, and a little side note: Maybe the "-mcpu=cortex-m0plus" should be replaced by "-mtune=cortex-m0plus". gcc is starting to abandon -mcpu on a few platforms.

@kilograham kilograham added this to the 1.3.0 milestone May 31, 2021
@kripton
Copy link

kripton commented Jun 26, 2021

C/CXX flags
I tried to override the C and linker flags to have different optimization than default by doing

cmake .. -DCMAKE_C_FLAGS="-O2 -flto -fno-data-sections" -DCMAKE_EXE_LINKER_FLAGS="-O2 -flto"

The result is this order of flags to the compiler:
...-I/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/include -O2 -flto -fno-data-sections -O3 -DNDEBUG -ffunction-sections -fdata-sections -o CMakeFiles/blink.dir/home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c.obj -c /home/blub/pico/pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c
Which means it's basically overriding my flags instead of the other way around. Somewhere it's including the override flags before the usual SDK given flags.

So if I add add_compile_options(-flto) in my project's CMakeLists.txt-file (between project and add_executable statements), it seems to be in effect. All compiler invocations work fine. However, during linking, I get this:

/tmp/cc1KABxt.s: Assembler messages:
/tmp/cc1KABxt.s:11061: Error: invalid offset, value too big (0x000041F4)

For very small projects (aka hello world) it seems to work fine. But for anything useful, it simply fails with the message above or similar.

gcc is:

kripton@momo ~/git/rp2040-compressiontest/build2 $ arm-none-eabi-gcc -v
Using built-in specs.
COLLECT_GCC=arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arm-none-eabi/11.1.0/lto-wrapper
Target: arm-none-eabi
Configured with: /var/tmp/portage/cross-arm-none-eabi/gcc-11.1.0-r1/work/gcc-11.1.0/configure --host=x86_64-pc-linux-gnu --target=arm-none-eabi --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/arm-none-eabi/gcc-bin/11.1.0 --includedir=/usr/lib/gcc/arm-none-eabi/11.1.0/include --datadir=/usr/share/gcc-data/arm-none-eabi/11.1.0 --mandir=/usr/share/gcc-data/arm-none-eabi/11.1.0/man --infodir=/usr/share/gcc-data/arm-none-eabi/11.1.0/info --with-gxx-include-dir=/usr/lib/gcc/arm-none-eabi/11.1.0/include/g++-v11 --with-python-dir=/share/gcc-data/arm-none-eabi/11.1.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 11.1.0-r1 p2' --disable-esp --enable-libstdcxx-time --with-build-config=bootstrap-lto --enable-poison-system-directories --disable-libstdcxx-time --with-sysroot=/usr/arm-none-eabi --disable-bootstrap --with-newlib --enable-multilib --disable-fixed-point --disable-libgomp --disable-libssp --disable-libada --disable-systemtap --enable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --enable-lto --with-isl --disable-isl-version-check --disable-libsanitizer --disable-default-pie --enable-default-ssp --with-cpu=cortex-m0plus
Thread model: single
Supported LTO compression algorithms: zlib zstd
gcc version 11.1.0 (Gentoo 11.1.0-r1 p2) 

binutils is v2.36.1

@kilograham
Copy link
Contributor

yeah, i think -flto is very free and easy with reordering of code... there are probably some assembly functions that expect to be proximate to each other. Sadly in some cases being the best way for LTO or for GC may not be the same thing, so we may have to special case.

@kilograham kilograham self-assigned this Jun 26, 2021
@kilograham
Copy link
Contributor

Note assigning myself, as we do plan to look at LTO in 1.3.0 release

@lurch
Copy link
Contributor

lurch commented Jun 28, 2021

Oh, and a little side note: Maybe the "-mcpu=cortex-m0plus" should be replaced by "-mtune=cortex-m0plus". gcc is starting to abandon -mcpu on a few platforms.

See #253 (comment) (and #260 )

@asumagic
Copy link

asumagic commented Aug 1, 2021

If it helps, disabling LTO for pico-sdk/src/rp2_common/hardware_irq/irq.c solves the "invalid offset, value too big" error in my case.

The linker however then spits out a ton of undefined references related to __wrap_ functions, which pointed me to these few bug reports, which may be relevant:

I tried using the gold linker, but it seems to dislike the linker scripts, and spits out some errors about a COPY section type: pico-sdk/src/rp2_common/pico_standard_link/memmap_default.ld:123:23: COPY section type is unsupported. According to the source, only NOLOAD is supported. I tried replacing the (COPY)s with (NOLOAD)s, which I assume is a bad idea, and... ended up triggering an internal error elsewhere anyway.

I tried using lld, but -fuse-ld=lld causes an error because on my system it looks for /usr/bin/arm-none-eabi-ld.lld. Symlinking /usr/bin/lld (which I didn't really expect to work anyway) fails with:

[build] arm-none-eabi-ld.lld: error: no memory region specified for section '.data'
[build] arm-none-eabi-ld.lld: error: no memory region specified for section '.bss'

... so I'm assuming figuring out what to do with the --wrap issue and fixing it would be the next step to get a proof of concept compilation with LTO.

... Good luck with this :)

@asumagic
Copy link

asumagic commented Aug 1, 2021

Looks like I got a bit further after all. In platform.h, I modified WRAPPER_FUNC to be defined as such:

#define WRAPPER_FUNC(x) __attribute__((used)) __wrap_ ## x

Now it successfully compiles and run.

Compared to having LTO activated for everything except the SDK (assuming I didn't mess up something), the generated .bin file is slightly larger, but I suspect this is may be due by a few things:

  • LTO inlining, which is not surprising
  • I have a function tagged with [[gnu::flatten]] somewhere (as it measurably improved performance), which may be inlining some SDK functions aggressively (SPI ones in particular).
  • The above __attribute__((used)) workaround may be causing some of the wrappers to not get eliminated even though they may be unused, including its callees.

I have not tested for performance, however.

(EDIT: as an interesting note, setting PICO_NO_GC_SECTIONS while LTO is active results in a significantly larger executable. I'm assuming that LTO cannot eliminate functions as much as the linker itself can. Does --gc-functions affect functions tagged with __attribute__((used))?)

@kilograham
Copy link
Contributor

Thanks; ugh i wasn't aware of the --wrap/LTO issue

@asumagic
Copy link

While everything seems to run fine with my workarounds applied and I have not encountered any breakage, I have realized something (which in retrospect is rather explicit in the reports I've linked to): gcc+bfd with LTO could inline some of the original, non-wrapped functions.

I have probably not encountered issues because function wraps that may have mattered were not inlined and the call site still referred to the symbol that got wrapped later on.
That may rather likely be because most wrapped functions are compiled in the newlib static archive (or whatever it is), which is not built with LTO on my platform... but it could be on another, from what I can tell.

This might be worth investigating further, because I'm not quite sure if that's happening for me here.

Interestingly, using x86-64 desktop linux compilers and a simple test case:

$ cat main.c 
#include "orig_decl.h"

int main()
{
        foo();
}
$ cat orig_decl.c
#include <stdio.h>

void foo()
{
        puts("foo()");
}
$ cat wrap.c
#include "orig_decl.h"

#include <stdio.h>

void __wrap_foo()
{
        puts("__wrap_foo()");
}

... only clang+lld properly emitted a call to __wrap_foo! gcc+bfd, gcc+gold, gcc+lld (doesn't seem to be a supported combination for LTO at all), clang+bfd, clang+gold all emitted a call directly to foo...

That being said, with clang thin LTO (-flto=thin), clang+lld AND clang+gold work. I assume this is a difference in the implementation of the LTO plugins for each of the linkers. I feel like that would be in a platform-independent manner, but this may be worth looking if the exact same behavior occurs when targeting armv6.

Rather disappointingly, in the above tests, __wrap_foo() never gets inlined into main (probably because wrapping happens after LTO is performed). foo() gets eliminated only with --gc-sections.

Unfortunately, I'm quite not sure what the best approach would be. Since clang support is in the same milestone as this, it might be worth trying to support clang LTO with lld only first if it turns out to be simpler to support - or use an approach different from --wrap, which here seems very much non-trivial, if possible at all.

(FWIW, LTO with any linker but bfd would be a significantly better developer experience due to that usecase being really slow due to lack of parallelization from what I can tell)

@niansa
Copy link

niansa commented Jan 19, 2024

This is such a simple fix, why after almost 3 years hasn't it been implemented properly yet?
I guess I could give it a go...

Edit: Though the retain attribute may be a better fit: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-retain-function-attribute

@asumagic
Copy link

I am not up to date, but unless pico-sdk stopped using wrap functions, I do not think the problem I described is gone. In particular, gcc+bfd LTO is insidiously broken.

@niansa
Copy link

niansa commented Jan 19, 2024

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

@asumagic
Copy link

Do make sure that this solution does not cause the issue that I mentioned above (i.e. that inlining may inline the original function rather than the wrapper). Depending on the wrapped function, there may not be a noticeable side-effect for your usecase (it has been fine for me), but it would likely not be upstreamable.

@kripton
Copy link

kripton commented Apr 15, 2024

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

do you have a fork of the pico-sdk where we could try that?

@asumagic
Copy link

asumagic commented Apr 15, 2024

Adding that attribute to all wrap functions as a proof of concept fixed it for me.

do you have a fork of the pico-sdk where we could try that?

This change would be a one-liner afaict, but beware the caveats I mentioned in my replies above.

7f494d0

@asumagic
Copy link

asumagic commented Aug 31, 2024

There was some activity in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88643 and it appears that the wrapping issue should be resolved with ld as of binutils 2.43. My distro doesn't package it yet for arm-none-eabi, though, so I'll see if I can try this easily myself shortly.

EDIT: no dice, still getting undefined references to __wrap_printf and such. Although the __attribute__((used)) still works (but might still be broken), as long as you set PICO_STDIO_SHORT_CIRCUIT_CLIB_FUNCS=0 (unless patching the pico-sdk anyway).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants