Skip to content

Latest commit

 

History

History
714 lines (576 loc) · 29 KB

size_optimizations.rst

File metadata and controls

714 lines (576 loc) · 29 KB

Size optimizations

This page contains recommendations for optimizing the size of embedded software including its memory and code footprints.

These recommendations are subject to change as the C++ standard and compilers evolve, and as the authors continue to gain more knowledge and experience in this area. If you disagree with recommendations, please discuss them with the Pigweed team, as we're always looking to improve the guide or correct any inaccuracies.

Compile Time Constant Expressions

The use of constexpr and soon with C++20 consteval can enable you to evaluate the value of a function or variable more at compile-time rather than only at run-time. This can often not only result in smaller sizes but also often times more efficient, faster execution.

We highly encourage using this aspect of C++, however there is one caveat: be careful in marking functions constexpr in APIs which cannot be easily changed in the future unless you can prove that for all time and all platforms, the computation can actually be done at compile time. This is because there is no "mutable" escape hatch for constexpr.

See the :doc:`embedded_cpp_guide` for more detail.

Templates

The compiler implements templates by generating a separate version of the function for each set of types it is instantiated with. This can increase code size significantly.

Be careful when instantiating non-trivial template functions with multiple types.

Consider splitting templated interfaces into multiple layers so that more of the implementation can be shared between different instantiations. A more advanced form is to share common logic internally by using default sentinel template argument value and ergo instantation such as pw::Vector's size_t kMaxSize = vector_impl::kGeneric or pw::span's size_t Extent = dynamic_extent.

Virtual Functions

Virtual functions provide for runtime polymorphism. Unless runtime polymorphism is required, virtual functions should be avoided. Virtual functions require a virtual table and a pointer to it in each instance, which all increases RAM usage and requires extra instructions at each call site. Virtual functions can also inhibit compiler optimizations, since the compiler may not be able to tell which functions will actually be invoked. This can prevent linker garbage collection, resulting in unused functions being linked into a binary.

When runtime polymorphism is required, virtual functions should be considered. C alternatives, such as a struct of function pointers, could be used instead, but these approaches may offer no performance advantage while sacrificing flexibility and ease of use.

Only use virtual functions when runtime polymorphism is needed. Lastly try to avoid templated virtual interfaces which can compound the cost by instantiating many virtual tables.

Devirtualization

When you do use virtual functions, try to keep devirtualization in mind. You can make it easier on the compiler and linker by declaring class definitions as final to improve the odds. This can help significantly depending on your toolchain.

If you're interested in more details, this is an interesting deep dive.

Initialization, Constructors, Finalizers, and Destructors

Constructors

Where possible consider making your constructors constexpr to reduce their costs. This also enables global instances to be eligible for .data or if all zeros for .bss section placement.

Static Destructors And Finalizers

For many embedded projects, cleaning up after the program is not a requirement, meaning the exit functions including any finalizers registered through atexit, at_quick_exit, and static destructors can all be removed to reduce the size.

The exact mechanics for disabling static destructors depends on your toolchain.

See the Ignored Finalizer and Destructor Registration section below for further details regarding disabling registration of functions to be run at exit via atexit and at_quick_exit.

Clang

With modern versions of Clang you can simply use -fno-C++-static-destructors and you are done.

GCC with newlib-nano

With GCC this is more complicated. For example with GCC for ARM Cortex M devices using newlib-nano you are forced to tackle the problem in two stages.

First, there are the destructors for the static global objects. These can be placed in the .fini_array and .fini input sections through the use of the -fno-use-cxa-atexit GCC flag, assuming newlib-nano was configured with HAVE_INITFINI_ARAY_SUPPORT. The two input sections can then be explicitly discarded in the linker script through the use of the special /DISCARD/ output section:

/DISCARD/ : {
/* The finalizers are never invoked when the target shuts down and ergo
 * can be discarded. These include C++ global static destructors and C
 * designated finalizers. */
*(.fini_array);
*(.fini);

Second, there are the destructors for the scoped static objects, frequently referred to as Meyer's Singletons. With the Itanium ABI these use __cxa_atexit to register destruction on the fly. However, if -fno-use-cxa-atexit is used with GCC and newlib-nano these will appear as __tcf_ prefixed symbols, for example __tcf_0.

There's an interesting proposal (P1247R0) to enable [[no_destroy]] attributes to C++ which would be tempting to use here. Alas this is not an option yet. As mentioned in the proposal one way to remove the destructors from these scoped statics is to wrap it in a templated wrapper which uses placement new.

#include <type_traits>

template <class T>
class NoDestroy {
 public:
  template <class... Ts>
  NoDestroy(Ts&&... ts) {
    new (&static_) T(std::forward<Ts>(ts)...);
  }

  T& get() { return reinterpret_cast<T&>(static_); }

 private:
  std::aligned_storage_t<sizeof(T), alignof(T)> static_;
};

This can then be used as follows to instantiate scoped statics where the destructor will never be invoked and ergo will not be linked in.

Foo& GetFoo() {
  static NoDestroy<Foo> foo(foo_args);
  return foo.get();
}

Strings

Tokenization

Instead of directly using strings and printf, consider using :ref:`module-pw_tokenizer` to replace strings and printf-style formatted strings with binary tokens during compilation. This can reduce the code size, memory usage, I/O traffic, and even CPU utilization by replacing snprintf calls with simple tokenization code.

Be careful when using string arguments with tokenization as these still result in a string in your binary which is appended to your token at run time.

String Formatting

The formatted output family of printf functions in <cstdio> are quite expensive from a code size point of view and they often rely on malloc. Instead, where tokenization cannot be used, consider using :ref:`module-pw_string`'s utilities.

Removing all printf functions often saves more than 5KiB of code size on ARM Cortex M devices using newlib-nano.

Logging & Asserting

Using tokenized backends for logging and asserting such as :ref:`module-pw_log_tokenized` coupled with :ref:`module-pw_assert_log` can drastically reduce the costs. However, even with this approach there remains a callsite cost which can add up due to arguments and including metadata.

Try to avoid string arguments and reduce unnecessary extra arguments where possible. And consider adjusting log levels to compile out debug or even info logs as code stabilizes and matures.

Future Plans

Going forward Pigweed is evaluating extra configuration options to do things such as dropping log arguments for certain log levels and modules to give users finer grained control in trading off diagnostic value and the size cost.

Threading and Synchronization Cost

Lighterweight Signaling Primatives

Consider using pw::sync::ThreadNotification instead of semaphores as they can be implemented using more efficient RTOS specific signaling primitives. For example on FreeRTOS they can be backed by direct task notifications which are more than 10x smaller than semaphores while also being faster.

Threads and their stack sizes

Although synchronous APIs are incredibly portable and often easier to reason about, it is often easy to forget the large stack cost this design paradigm comes with. We highly recommend watermarking your stacks to reduce wasted memory.

Our snapshot integration for RTOSes such as :ref:`module-pw_thread_freertos` and :ref:`module-pw_thread_embos` come with built in support to report stack watermarks for threads if enabled in the kernel.

In addition, consider using asynchronous design patterns such as Active Objects which can use :ref:`module-pw_work_queue` or similar asynchronous dispatch work queues to effectively permit the sharing of stack allocations.

Buffer Sizing

We'd be remiss not to mention the sizing of the various buffers that may exist in your application. You could consider watermarking them with :ref:`module-pw_metric`. You may also be able to adjust their servicing interval and priority, but do not forget to keep the ingress burst sizes and scheduling jitter into account.

Standard C and C++ libraries

Toolchains are typically distributed with their preferred standard C library and standard C++ library of choice for the target platform.

Although you do not always have a choice in what standard C library and what standard C++ library is used or even how it's compiled, stay vigilant for common sources of bloat.

Assert

The standard C library should provides the assert function or macro which may be internally used even if your application does not invoke it directly. Although this can be disabled through NDEBUG there typically is not a portable way of replacing the assert(condition) implementation without configuring and recompiling your standard C library.

However, you can consider replacing the implementation at link time with a cheaper implementation. For example newlib-nano, which comes with the GNU Arm Embedded Toolchain, often has an expensive __assert_func implementation which uses fiprintf to print to stderr before invoking abort(). This can be replaced with a simple PW_CRASH invocation which can save several kilobytes in case fiprintf isn't used elsewhere.

One option to remove this bloat is to use --wrap at link time to replace these implementations. As an example in GN you could replace it with the following BUILD.gn file:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

# Wraps the function called by newlib's implementation of assert from stdlib.h.
#
# When using this, we suggest injecting :newlib_assert via pw_build_LINK_DEPS.
config("wrap_newlib_assert") {
  ldflags = [ "-Wl,--wrap=__assert_func" ]
}

# Implements the function called by newlib's implementation of assert from
# stdlib.h which invokes __assert_func unless NDEBUG is defined.
pw_source_set("wrapped_newlib_assert") {
  sources = [ "wrapped_newlib_assert.cc" ]
  deps = [
    "$dir_pw_assert:check",
    "$dir_pw_preprocessor",
  ]
}

And a wrapped_newlib_assert.cc source file implementing the wrapped assert function:

#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"

// This is defined by <cassert>
extern "C" PW_NO_RETURN void __wrap___assert_func(const char*,
                                                  int,
                                                  const char*,
                                                  const char*) {
  PW_CRASH("libc assert() failure");
}

Ignored Finalizer and Destructor Registration

Even if no cleanup is done during shutdown for your target, shutdown functions such as atexit, at_quick_exit, and __cxa_atexit can sometimes not be linked out. This may be due to vendor code or perhaps using scoped statics, also known as Meyer's Singletons.

The registration of these destructors and finalizers may include locks, malloc, and more depending on your standard C library and its configuration.

One option to remove this bloat is to use --wrap at link time to replace these implementations with ones which do nothing. As an example in GN you could replace it with the following BUILD.gn file:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

config("wrap_atexit") {
  ldflags = [
    "-Wl,--wrap=atexit",
    "-Wl,--wrap=at_quick_exit",
    "-Wl,--wrap=__cxa_atexit",
  ]
}

# Implements atexit, at_quick_exit, and __cxa_atexit from stdlib.h with noop
# versions for targets which do not cleanup during exit and quick_exit.
#
# This removes any dependencies which may exist in your existing libc.
# Although this removes the ability for things such as Meyer's Singletons,
# i.e. non-global statics, to register destruction function it does not permit
# them to be garbage collected by the linker.
pw_source_set("wrapped_noop_atexit") {
  sources = [ "wrapped_noop_atexit.cc" ]
}

And a wrapped_noop_atexit.cc source file implementing the noop functions:

// These two are defined by <cstdlib>.
extern "C" int __wrap_atexit(void (*)(void)) { return 0; }
extern "C" int __wrap_at_quick_exit(void (*)(void)) { return 0; }

// This function is part of the Itanium C++ ABI, there is no header which
// provides this.
extern "C" int __wrap___cxa_atexit(void (*)(void*), void*, void*) { return 0; }

Unexpected Bloat in Disabled STL Exceptions

The GCC manual recommends using -fno-exceptions along with -fno-unwind-tables to disable exceptions and any associated overhead. This should replace all throw statements with calls to abort().

However, what we've noticed with the GCC and libstdc++ is that there is a risk that the STL will still throw exceptions when the application is compiled with -fno-exceptions and there is no way for you to catch them. In theory, this is not unsafe because the unhandled exception will invoke abort() via std::terminate(). This can occur because the libraries such as libstdc++.a may not have been compiled with -fno-exceptions even though your application is linked against it.

See this for more information.

Unfortunately there can be significant overhead surrounding these throw call sites in the std::__throw_* helper functions. These implementations such as std::__throw_out_of_range_fmt(const char*, ...) and their snprintf and ergo malloc dependencies can very quickly add up to many kilobytes of unnecessary overhead.

One option to remove this bloat while also making sure that the exceptions will actually result in an effective abort() is to use --wrap at link time to replace these implementations with ones which simply call PW_CRASH.

As an example in GN you could replace it with the following BUILD.gn file, note that the mangled names must be used:

import("//build_overrides/pigweed.gni")

import("$dir_pw_build/target_types.gni")

# Wraps the std::__throw_* functions called by GNU ISO C++ Library regardless
# of whether "-fno-exceptions" is specified.
#
# When using this, we suggest injecting :wrapped_libstdc++_functexcept via
# pw_build_LINK_DEPS.
config("wrap_libstdc++_functexcept") {
  ldflags = [
    "-Wl,--wrap=_ZSt21__throw_bad_exceptionv",
    "-Wl,--wrap=_ZSt17__throw_bad_allocv",
    "-Wl,--wrap=_ZSt16__throw_bad_castv",
    "-Wl,--wrap=_ZSt18__throw_bad_typeidv",
    "-Wl,--wrap=_ZSt19__throw_logic_errorPKc",
    "-Wl,--wrap=_ZSt20__throw_domain_errorPKc",
    "-Wl,--wrap=_ZSt24__throw_invalid_argumentPKc",
    "-Wl,--wrap=_ZSt20__throw_length_errorPKc",
    "-Wl,--wrap=_ZSt20__throw_out_of_rangePKc",
    "-Wl,--wrap=_ZSt24__throw_out_of_range_fmtPKcz",
    "-Wl,--wrap=_ZSt21__throw_runtime_errorPKc",
    "-Wl,--wrap=_ZSt19__throw_range_errorPKc",
    "-Wl,--wrap=_ZSt22__throw_overflow_errorPKc",
    "-Wl,--wrap=_ZSt23__throw_underflow_errorPKc",
    "-Wl,--wrap=_ZSt19__throw_ios_failurePKc",
    "-Wl,--wrap=_ZSt19__throw_ios_failurePKci",
    "-Wl,--wrap=_ZSt20__throw_system_errori",
    "-Wl,--wrap=_ZSt20__throw_future_errori",
    "-Wl,--wrap=_ZSt25__throw_bad_function_callv",
  ]
}

# Implements the std::__throw_* functions called by GNU ISO C++ Library
# regardless of whether "-fno-exceptions" is specified with PW_CRASH.
pw_source_set("wrapped_libstdc++_functexcept") {
  sources = [ "wrapped_libstdc++_functexcept.cc" ]
  deps = [
    "$dir_pw_assert:check",
    "$dir_pw_preprocessor",
  ]
}

And a wrapped_libstdc++_functexcept.cc source file implementing each wrapped and mangled std::__throw_* function:

#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"

// These are all wrapped implementations of the throw functions provided by
// libstdc++'s bits/functexcept.h which are not needed when "-fno-exceptions"
// is used.

// std::__throw_bad_exception(void)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_bad_exceptionv() {
  PW_CRASH("std::throw_bad_exception");
}

// std::__throw_bad_alloc(void)
extern "C" PW_NO_RETURN void __wrap__ZSt17__throw_bad_allocv() {
  PW_CRASH("std::throw_bad_alloc");
}

// std::__throw_bad_cast(void)
extern "C" PW_NO_RETURN void __wrap__ZSt16__throw_bad_castv() {
  PW_CRASH("std::throw_bad_cast");
}

// std::__throw_bad_typeid(void)
extern "C" PW_NO_RETURN void __wrap__ZSt18__throw_bad_typeidv() {
  PW_CRASH("std::throw_bad_typeid");
}

// std::__throw_logic_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_logic_errorPKc(const char*) {
  PW_CRASH("std::throw_logic_error");
}

// std::__throw_domain_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_domain_errorPKc(const char*) {
  PW_CRASH("std::throw_domain_error");
}

// std::__throw_invalid_argument(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_invalid_argumentPKc(
    const char*) {
  PW_CRASH("std::throw_invalid_argument");
}

// std::__throw_length_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_length_errorPKc(const char*) {
  PW_CRASH("std::throw_length_error");
}

// std::__throw_out_of_range(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_out_of_rangePKc(const char*) {
  PW_CRASH("std::throw_out_of_range");
}

// std::__throw_out_of_range_fmt(const char*, ...)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_out_of_range_fmtPKcz(
    const char*, ...) {
  PW_CRASH("std::throw_out_of_range");
}

// std::__throw_runtime_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_runtime_errorPKc(
    const char*) {
  PW_CRASH("std::throw_runtime_error");
}

// std::__throw_range_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_range_errorPKc(const char*) {
  PW_CRASH("std::throw_range_error");
}

// std::__throw_overflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt22__throw_overflow_errorPKc(
    const char*) {
  PW_CRASH("std::throw_overflow_error");
}

// std::__throw_underflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt23__throw_underflow_errorPKc(
    const char*) {
  PW_CRASH("std::throw_underflow_error");
}

// std::__throw_ios_failure(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKc(const char*) {
  PW_CRASH("std::throw_ios_failure");
}

// std::__throw_ios_failure(const char*, int)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKci(const char*,
                                                                  int) {
  PW_CRASH("std::throw_ios_failure");
}

// std::__throw_system_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_system_errori(int) {
  PW_CRASH("std::throw_system_error");
}

// std::__throw_future_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_future_errori(int) {
  PW_CRASH("std::throw_future_error");
}

// std::__throw_bad_function_call(void)
extern "C" PW_NO_RETURN void __wrap__ZSt25__throw_bad_function_callv() {
  PW_CRASH("std::throw_bad_function_call");
}

Compiler and Linker Optimizations

Compiler Optimization Options

Don't forget to configure your compiler to optimize for size if needed. With Clang this is -Oz and with GCC this can be done via -Os. The GN toolchains provided through :ref:`module-pw_toolchain` which are optimized for size are suffixed with *_size_optimized.

Garbage collect function and data sections

By default the linker will place all functions in an object within the same linker "section" (e.g. .text). With Clang and GCC you can use -ffunction-sections and -fdata-sections to use a unique "section" for each object (e.g. .text.do_foo_function). This permits you to pass --gc-sections to the linker to cull any unused sections which were not referenced.

To see what sections were garbage collected you can pass --print-gc-sections to the linker so it prints out what was removed.

The GN toolchains provided through :ref:`module-pw_toolchain` are configured to do this by default.

Function Inlining

Don't forget to expose trivial functions such as member accessors as inline definitions in the header. The compiler and linker can make the trade-off on whether the function should be actually inlined or not based on your optimization settings, however this at least gives it the option. Note that LTO can inline functions which are not defined in headers.

We stand by the Google style guide to recommend considering this for simple functions which are 10 lines or less.

Link Time Optimization (LTO)

Summary: LTO can decrase your binary size, at a cost: LTO makes debugging harder, interacts poorly with linker scripts, and makes crash reports less informative. We advise only enabling LTO when absolutely necessary.

Link time optimization (LTO) moves some optimizations from the individual compile steps to the final link step, to enable optimizing across translation unit boundaries.

LTO can both increase performance and reduce binary size for embedded projects. This appears to be a strict improvement; and one might think enabling LTO at all times is the best approach. However, this is not the case; in practice, LTO is a trade-off.

LTO benefits

  • Reduces binary size - When compiling with size-shrinking flags like -Oz, some function call overhead can be eliminated, and code paths might be eliminated by the optimizer after inlining. This can include critical abstraction removal like devirtualization.
  • Improves performance - When code is inlined, the optimizer can better reduce the number of instructions. When code is smaller, the instruction cache has better hit ratio leading to better performance. In some cases, entire function calls are eliminated.

LTO costs

  • LTO interacts poorly with linker scripts - Production embedded projects often have complicated linker scripts to control the physical layout of code and data on the device. For example, you may want to put performance critical audio codec functions into the fast tightly coupled (TCM) memory region. However, LTO can interact with linker script requirements in strange ways, like inappropriately inlining code that was manually placed into other functions in the wrong region; leading to hard-to-understand bugs.
  • Debugging LTO binaries is harder - LTO increases the differences between the machine code and the source code. This makes stepping through source code in a debugger confusing, since the instruction pointer can hop around in confusing ways.
  • Crash reports for LTO binaries can be misleading - Just as with debugging, LTO'd binaries can produce confusing stacks in crash reports.
  • LTO significantly increases build times - The compilation model is different when LTO is enabled, since individual translation unit compilations (.cc --> .o) files now produce LLVM- or GCC- IR instead of native machine code; machine code is only generated at the link phase. This makes the final link step take significantly longer. Since any source changes will result in a link step, developer velocity is reduced due to the slow compile time.

How to enable LTO

On GCC and Clang LTO is enabled by passing -flto to both the compiler and the linker. On GCC -fdevirtualize-at-ltrans enables more aggressive devirtualization.

Our recommendation

  • Disable LTO unless absolutely necessary; e.g. due to lack of space.
  • When enabling LTO, carefully and thoroughly test the resulting binary.
  • Check that crash reports are still useful under LTO for your product.

Disabling Scoped Static Initialization Locks

C++11 requires that scoped static objects are initialized in a thread-safe manner. This also means that scoped statics, i.e. Meyer's Singletons, be thread-safe. Unfortunately this rarely is the case on embedded targets. For example with GCC on an ARM Cortex M device if you test for this you will discover that instead the program crashes as reentrant initialization is detected through the use of guard variables.

With GCC and Clang, -fno-threadsafe-statics can be used to remove the global lock which often does not work for embedded targets. Note that this leaves the guard variables in place which ensure that reentrant initialization continues to crash.

Be careful when using this option in case you are relying on threadsafe initialization of statics and the global locks were functional for your target.

Triaging Unexpectedly Linked In Functions

Lastly as a tip if you cannot figure out why a function is being linked in you can consider:

  • Using --wrap with the linker to remove the implementation, resulting in a link failure which typically calls out which calling function can no longer be linked.
  • With GCC, you can use -fcallgraph-info to visualize or otherwise inspect the callgraph to figure out who is calling what.
  • Sometimes symbolizing the address can resolve what a function is for. For example if you are using newlib-nano along with -fno-use-cxa-atexit, scoped static destructors are prefixed __tcf_*. To figure out object these destructor functions are associated with, you can use llvm-symbolizer or addr2line and these will often print out the related object's name.

Sorting input sections by alignment

Linker scripts often contain input section wildcard patterns to specify which input sections should be placed in each output section. For example, say a linker script contains a sections command like the following:

.text : { *(.init*) *(.text*) }

By default, the GCC and Clang linkers will place symbols matched by each wildcard pattern in the order they are seen at link-time. The linker will insert padding bytes as necessary to satisfy the alignment requirements of each symbol.

The GCC and Clang linkers allow one to first sort matched symbols for each wildcard pattern by alignment with the SORT_BY_ALIGNMENT keyword, which can reduce the amount of necessary padding bytes and save memory. This can be used to enable alignment sort on a per-pattern basis like so:

.text : { *(SORT_BY_ALIGNMENT(.init*)) *(SORT_BY_ALIGNMENT(.text*)) }

This keyword can be applied globally to all wildcard matches in your linker script by passing the --sort-section=alignment option to the linker.

See the ld manual for more information.