Full Changelog: v0.1.0...v0.2.0
Foundational Improvements
Large generative AI model support
- Support generative AI models like Meta Llama 3 8B and Llama 2 7B on Android and iOS phones
- 4-bit group-wise weight quantization
- XNNPACK Delegate and kernels for best performance on CPU (WIP on other backends)
- KV Cache support through PyTorch mutable buffer
- Custom ops for SDPA, with kv cache and multi-query attention
- ExecuTorch Runtime + tokenizer and sampler
Core ExecuTorch improvements
- Simplified setup experience
- Support for PyTorch mutable buffers
- Support for multi-gigabyte models
- Constant data moved to its own .pte segment for more efficient serialization
- Better kernel coverage in portable lib, XNNPACK, ARM, CoreML, MPS and HTP delegates.
- SDK - better profiling and debugging within delegates
- API improvements/simplification
- Dozens of fixes to fuzzer-identified .pte file-parsing issues
- Vulkan delegate for mobile GPU
- Data-type based selective build for optimizing binary size
- Compatibility with torchtune
- More models supported across different backends
- Python code now available as the "executorch" pip package in PyPI
Hardware Acceleration Improvements
Arm
- Significant boost in operator test coverage thought the use of TOSA reference model, as well as improved CI coverage
- Added support for quantization with the ArmQuantizer
- Added support for MobileNet v2 TOSA generation
- Working towards MobileNet v2 execution on Ethos-U
- Added support for multiple new operators on Ethos-U compiler
- Added NCHW/NHWC conversion for Ethos-U targets until NHWC is supported by ExecuTorch
- Arm backend example now works on MacOS
Apple Core ML
- [SDK] ExecuTorch SDK Integration for better debugging and profiling experience
- [SDK] ExecuTorch SDK integration using the new MLComputePlan API released in iOS 17.4 and macOS 14.4
- [SDK] A model lowered to the CoreML backend can be profiled using the ExecuTorch Inspector without additional setup
- [SDK] Profiling surfaces Core ML specific information for each operation in the model, including: supported compute devices, preferred compute device, and estimated cost for each compute device.
- [SDK] The Core ML delegate backend also supports logging intermediate tensors for model debugging.
- [Partitioner] Enables a developer to lower a model even if Core ML doesn’t support all the operations in the model.
- [Partitioner] A developer will now be able to specify the operations that should be skipped by the Core ML backend when lowering the model.
- [Quantizer] Leverages PyTorch 2.0 export-based quantization APIs.
- [Quantizer] Encodes specific quantization rules in order to optimize the model for execution on Apple silicon
- [Quantizer] Integrated with ExecuTorch Core ML delegate conversion pipeline
Apple MPS
- Support for over 100 ops (parity with PyTorch MPS backend supported ops)
- Support for iOS/iPadOS>=14.4+ / macOS>=12.4
- Support for MPSPartitioner
- Support for following dtypes: fp16, fp32, bfloat16, int8, int16, int32, int64, uint8, bool
- Support for profiling (etrecord, etdump) through Inspector API
- Full unit testing coverage for AOT and runtime for all supported operators
- Enabled storiesllama (floating point) on MPS
Qualcomm
- Support for Snapdragon 8 Gen 3 is added.
- Enabled on-device compilation. (aka QNN online-prepare)
- Enabled 4-bit and 16-bit quantization.
- Qualcomm AI Studio QNN Profiling is integrated into ExecuTorch flow.
- Enabled storiesllama on HTP-fp16 (but this effort is mainly thanks to Chen Lai from Meta being the main contributor for this)
- Added more operators support
- Additional models validated since v0.1.0:
- FbNet
- W2l (Wav2LetterModel)
- SSD300_VGG16
- ViT
- Quantized MobileBert (Quantized MobileBert contribution was submitted prior to v0.1.0 timeline, but merged afterwards)
Cadence HiFi
- Expanded operator support for Cadence HiFi targets
- Added first small model (RNNT-emformer predictor) to the Cadence HiFi examples
Model Support
Validated with one or more delegates
|
|
|
Meta Llama 2 7B |
LearningToPaint |
resnet50 |
Meta Llama 3 8B |
lennard_jones |
shufflenet_v2_x1_0 |
Conformer |
LSTM |
squeezenet1_1 |
dcgan |
maml_omniglot |
SqueezeSAM |
Deeplab_v3 |
mnasnet1_0 |
timm_efficientnet |
Edsr |
Mobilebert |
Torchvision_vit |
Emformer_rnnt |
Mobilenet_v2 |
Wav2letter |
functorch_dp_cifar10 |
Mobilenet_v3 |
Yolo v5 |
Inception_v3 |
phlippe_resnet |
|
Inception_v4 |
resnet18 |
|
Tested with torch.export
but not optimized for performance
|
|
|
Aquila 1 7B |
GPT-2 |
PLaMo 13B |
Aquila 2 7B |
GPT-J 6B |
Qwen 1.5 7B |
Baichuan 1 7B |
InternLM2 7B |
Refact |
BioGPT |
Koala |
RWKV 5 world 1B5 |
BLOOM 7B1 |
MiniCPM 2B sft |
Stable LM 2 1.6B |
Chinese Alpaca 2 7B |
Mistral 7B |
Stable LM 3B |
Chinese LLaMA 2 7B |
Mixtral 8x7B MoE |
Starcoder |
CodeShell |
Persimmon 8B chat |
Starcoder 2 |
Deepseek |
Phi 1 |
Vigogne (French) |
GPT Neo 1.3B |
Phi 1.5 |
Yi 6B |
GPT NeoX 20B |
Phi 2 |
|