Skip to content

v0.6.0

Latest
Compare
Choose a tag to compare
@metascroy metascroy released this 24 Apr 22:39
· 564 commits to main since this release
e67ef3b

We're excited to announce the release of ExecuTorch 0.6! This release builds upon the foundation established in the 0.5 release (January 30, 2025) with significant improvements across various components of the framework. ExecuTorch continues to be a powerful solution for on-device AI, powering experiences across a wide range of platforms.

Highlights

  • Improved Usability and Stability: Significantly improved usability and stability fixes solving a lot of the issues users were having when trying to install and use ExecuTorch. In particular, XNNPACK is enabled for all pip wheel builds, and CoreML is enabled for all macOS pip wheel builds.
  • Added Windows Support (Experimental): Added native windows support for development and deployment
  • Ready made packages for iOS and Android: Mobile developers can use ExecuTorch in Xcode and Android Studio without having to build from source
  • Native Objective-C and Swift APIs for Apple platforms: Mobile developers can simply import ExecuTorch in their Swift code, rather than depend on C++ APIs

API Changes

  • ET_LOG_MSG_AND_RETURN_IF_FALSE and ET_LOG_AND_RETURN_IF_FALSE are now deprecated; use ET_CHECK_OR_RETURN_FALSE with a descriptive message instead. (#8451)
  • ExecuTorch now has access to c10::irange from PyTorch. (#8572)
  • Non-internal APIs in elementwise_util.h are now deprecated. (#9621)
  • Added BroadcastIndexesRange utility to simplify looping over Tensor elements with broadcasting. (#8864)
  • Executorch::runtime::etensor::optional is now an alias for std::optional. (#9068)
  • In Android, class org.pytorch.executorch.{LlamaModule, LlamaCallback} is moved to org.pytorch.executorch.extension.llm.{LlmModule, LlmCallback} (#9478)

Build

  • pip wheel
    • Reduced wheel installation to a single step: install_executorch.sh
      • install_executorch now automatically manages requirement installation, wheel building, and wheel installation (#7708)
      • XNNPACK is enabled by default for all pip wheel builds (#9773)
      • CoreML is enabled by default for all pip wheel builds on macOS (#9483)
    • Wheel building support more configurations
      • Support editable mode via ./install_executorch.sh --editable (#8722)
      • pip wheel builds can be configured using the CMAKE_ARGS environment variable (#9583)
  • CMake
    • Removed the need to pre-install flatc by building it from source for the host machine (#9077)
    • Removed forced “no-exeception” and “no-rtti” builds (#7746)
    • Removed the need to set CMAKE_PREFIX_PATH (#8474)
    • Increased safety and correctness by building with address and undefined sanitizers in CI (#9397)

Backend Delegates

Arm

  • Added support for more models
    • TOSA: mv3, LSTM, Conformer, LeNet, vit, w2l, ic3, ic4, resnet18 and resent50
    • Ethos-U55: mv3, LSTM, Conformer and LeNet
    • Ethos-U85: mv3, LSTM, Conformer, LeNet, w2l and ic4
  • Updated op support:
    • Added scalar_tensor, ceil, floor, clamp, rsub, full_like, hardswish, hardsigmoid
    • Added logical and, or, xor, not
    • Added amax.max.amin.min, abs
    • Added support for comparison operator
    • Added support for arange
    • Added support for transpose and matmul on Ethos-U55
    • Added all ops not supported on Ethos-U55 to support-check
  • Enabled dim_order
  • quantize/dequantize folding support
  • Updated documentation
  • Devtools updates
    • Added support to visualize graph
    • Added support for using devtools etdump, and BundleIO to example
    • Added a report explaining why TosaPartitioner skipped partitioning nodes.
  • Version dependency updates:
    • Vela updated to use versions on GitLab
    • Moved to TOSA v0.80.1
    • Moved ethos-u-core-driver to 25.02

CoreML

  • CoreML is supported out of the box in the pip package on macOS (#9483)
  • Added support for the new to_edge_lower_and_transform API (#8505)
  • Added dynamic shape support for CPU/GPU compute units (#9094)
  • Documentation improvements

Qualcomm

  • LLM
    • Added support for hybrid mode with batch prefill and weight sharing for LLM in HTP
    • Improved prefill latency with new rope by 2.3x.
    • Unified llama2/llama3 and improved accuracy
    • Added support for AR-N for both improved prefill latency and preparing for multiturn conversations.
  • Enabled 21 new ops
    • adaptive_avg_pool2d, amax, bitwise_and, arange.start_step, argmin, cumsum, elu, eq, exp, full_like, full, ge, gt, instance_norm, logical_not, lt, bitwise_or, scalar_tensor, stack, unbind, where
  • Enabled 3 new models
  • Added support for 6 new SoCs
    • SM8750 (Snapdragon 8 Elite)
    • SSG2115P
    • SSG2125P
    • SXR1230P
    • SXR2230P
    • SXR2330P
  • Updated documentation
  • Added CI for multiple models (llama, mobilebert and W2L)
  • Added support for dynamic shapes in limited ops
  • Added support for block-wise quantization
  • Added support for context dump utility

MediaTek

  • Added support for dim_order

MPS

Vulkan

  • Added the ability to use push constants in Vulkan compute shaders (#7317), resulting in minor performance improvements in several operators (e.g. view, binary ops, permute, convolution ops) by using push constants to pass in tensor metadata instead of uniform buffer objects
  • Various performance improvements in convolution compute shaders (e.g. #7973, #7499, #7503, #7504, #7505, #7506)
  • Added support for leaky ReLU activation operator (#7975)
  • Updated several ops (slice, permute, cat, split) to support all memory layouts when using texture storage
  • Developer experience
    • Log generated GLSL file path when SPIR-V compilation fails (#9064)
    • Cache compiled SPIR-V shaders and only recompile modified shader templates (#9701)

XNNPACK

  • Reduced duplicate tensor weights in the pte file, saving significant memory when different methods from the same model use the same weights (#9153)
  • Added ENABLE_XNNPACK_WEIGHTS_CACHE. This reduces runtime memory for weights shared across methods (#9155)
  • Resolved issues around fp16 inference, linear layers will now serialize weights in fp16 when doing fp16 computation (#9753)
  • Added preferred quantization options for xnnpack models delegated with aot_compiler.py (#9634)
  • The rsqrt operator is now supported (#7992)

Devtools

  • executor_runner now works with models whose inputs are reused by memory planning. (#9340)
  • executor_runner now prints total execution time. (#9342)

Llama Model Support

  • Low-bit kernel improvements
    • With low-bit kernels, users can quantize linear and embedding layers in LLMs to 1-8 bits. In this release, we added dynamic shape support for low-bit kernels. Previously they required exporting LLMs with static shapes. (#9555)
    • Added new shared embedding kernels (1-8 bit) for sharing weights in quantized embedding and unembedding ops. In models like Llama1B/3B where the embedding and unembedding weights are shared, this can reduce the quantized model size significantly (9548)
  • Attention interface and a simple registry (#8039), with single location to update optional args for all attentions (#8128). Static attention (#8310) with IO manager (#8486).
  • Export_llama with hugging face card name (#9538)
  • Allow weights to be quantized in their original checkpoint dtype
  • Enabled Phi4, Qwen 2.5, and SmolLm2 with high out-of-box performance
  • Tokenizer
    • HuggingFace tokenizer support included (#11)
    • tiktoken tokenizer memory reduced from 19.41MiB to 3.23MiB on Android (#37)
    • tiktoken tokenizer lookup speed improved (#37)
      • From integer to string: 4997us (100iterations) compared with 6029us using std::unordered_map
      • From string to integer: 300us (100iterations) compared with 557us using std::unordered_map

Ops and kernels

  • The vast majority of portable ops now support BFloat16 and Half (#7748)
  • The _fft_r2c core ATen op is now supported as an optimized op (#8277)
  • There is now an optimized op for where (#8866)
  • Many portable op implementations (including but not necessarily limited to those using elementwise_util and apply_unary_map_fun) will now be parallelized if optimized kernels are enabled (#8932)
  • There is now a portable kernel for ELU (#9520)
  • Portable kernel for unfold_copy (#8952)

First Time Contributors

Thanks to the following contributors for making their first commit for this release!
@Jayantparashar10, @tonykao8080, @daniil-lyakhov,@wesleyer, @cptspacemanspiff, @izaitsevfb, @billmguo, @michaelpaskett-meta, @emmanuel-ferdman, @pncosta22, @cmt0, @orionr, @zhxchen17, @arkylin, @r4ghu, @luyich, @tirwu01, @daseyb, @JakeStevens, @sskarz, @sabarishsnk, @annakukliansky, @SamGondelman, @madhu-fb, @wahmed991, @redmercury, @yu-frank, @bsoyluoglu, @ChristianWLang, @Juanfi8, @Megan0704-1, @Lucaskabela, @codereba, @yifanjiang2

Full Changelog: v0.5.0...v0.6.0