Skip to content

Move backend docs to new locations #8413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<!---- Name is a WIP - this reflects better what it can do today ----->
# Building and Running ExecuTorch with ARM Ethos-U Backend
# ARM Ethos-U Backend

<!----This will show a grid card on the page----->
::::{grid} 2
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Building and Running ExecuTorch on Xtensa HiFi4 DSP
# Cadence Xtensa Backend


In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Building and Running ExecuTorch with Core ML Backend
# Core ML Backend

Core ML delegate uses Core ML APIs to enable running neural networks via Apple's hardware acceleration. For more about Core ML you can read [here](https://developer.apple.com/documentation/coreml). In this tutorial, we will walk through the steps of lowering a PyTorch model to Core ML delegate

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Building and Running ExecuTorch with MediaTek Backend
# MediaTek Backend

MediaTek backend empowers ExecuTorch to speed up PyTorch models on edge devices that equips with MediaTek Neuron Processing Unit (NPU). This document offers a step-by-step guide to set up the build environment for the MediaTek ExecuTorch libraries.

Expand Down
157 changes: 157 additions & 0 deletions docs/source/backends-mps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# MPS Backend

In this tutorial we will walk you through the process of getting setup to build the MPS backend for ExecuTorch and running a simple model on it.

The MPS backend device maps machine learning computational graphs and primitives on the [MPS Graph](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph?language=objc) framework and tuned kernels provided by [MPS](https://developer.apple.com/documentation/metalperformanceshaders?language=objc).

::::{grid} 2
:::{grid-item-card} What you will learn in this tutorial:
:class-card: card-prerequisites
* In this tutorial you will learn how to export [MobileNet V3](https://pytorch.org/vision/main/models/mobilenetv3.html) model to the MPS delegate.
* You will also learn how to compile and deploy the ExecuTorch runtime with the MPS delegate on macOS and iOS.
:::
:::{grid-item-card} Tutorials we recommend you complete before this:
:class-card: card-prerequisites
* [Introduction to ExecuTorch](intro-how-it-works.md)
* [Setting up ExecuTorch](getting-started-setup.md)
* [Building ExecuTorch with CMake](runtime-build-and-cross-compilation.md)
* [ExecuTorch iOS Demo App](demo-apps-ios.md)
* [ExecuTorch iOS LLaMA Demo App](llm/llama-demo-ios.md)
:::
::::


## Prerequisites (Hardware and Software)

In order to be able to successfully build and run a model using the MPS backend for ExecuTorch, you'll need the following hardware and software components:

### Hardware:
- A [mac](https://www.apple.com/mac/) for tracing the model

### Software:

- **Ahead of time** tracing:
- [macOS](https://www.apple.com/macos/) 12

- **Runtime**:
- [macOS](https://www.apple.com/macos/) >= 12.4
- [iOS](https://www.apple.com/ios) >= 15.4
- [Xcode](https://developer.apple.com/xcode/) >= 14.1

## Setting up Developer Environment

***Step 1.*** Please finish tutorial [Setting up ExecuTorch](https://pytorch.org/executorch/stable/getting-started-setup).

***Step 2.*** Install dependencies needed to lower MPS delegate:

```bash
./backends/apple/mps/install_requirements.sh
```

## Build

### AOT (Ahead-of-time) Components

**Compiling model for MPS delegate**:
- In this step, you will generate a simple ExecuTorch program that lowers MobileNetV3 model to the MPS delegate. You'll then pass this Program (the `.pte` file) during the runtime to run it using the MPS backend.

```bash
cd executorch
# Note: `mps_example` script uses by default the MPSPartitioner for ops that are not yet supported by the MPS delegate. To turn it off, pass `--no-use_partitioner`.
python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --bundled --use_fp16

# To see all options, run following command:
python3 -m examples.apple.mps.scripts.mps_example --help
```

### Runtime

**Building the MPS executor runner:**
```bash
# In this step, you'll be building the `mps_executor_runner` that is able to run MPS lowered modules:
cd executorch
./examples/apple/mps/scripts/build_mps_executor_runner.sh
```

## Run the mv3 generated model using the mps_executor_runner

```bash
./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_bundled_fp16.pte --bundled_program
```

- You should see the following results. Note that no output file will be generated in this example:
```
I 00:00:00.003290 executorch:mps_executor_runner.mm:286] Model file mv3_mps_bundled_fp16.pte is loaded.
I 00:00:00.003306 executorch:mps_executor_runner.mm:292] Program methods: 1
I 00:00:00.003308 executorch:mps_executor_runner.mm:294] Running method forward
I 00:00:00.003311 executorch:mps_executor_runner.mm:349] Setting up non-const buffer 1, size 606112.
I 00:00:00.003374 executorch:mps_executor_runner.mm:376] Setting up memory manager
I 00:00:00.003376 executorch:mps_executor_runner.mm:392] Loading method name from plan
I 00:00:00.018942 executorch:mps_executor_runner.mm:399] Method loaded.
I 00:00:00.018944 executorch:mps_executor_runner.mm:404] Loading bundled program...
I 00:00:00.018980 executorch:mps_executor_runner.mm:421] Inputs prepared.
I 00:00:00.118731 executorch:mps_executor_runner.mm:438] Model executed successfully.
I 00:00:00.122615 executorch:mps_executor_runner.mm:501] Model verified successfully.
```

### [Optional] Run the generated model directly using pybind
1. Make sure `pybind` MPS support was installed:
```bash
./install_executorch.sh --pybind mps
```
2. Run the `mps_example` script to trace the model and run it directly from python:
```bash
cd executorch
# Check correctness between PyTorch eager forward pass and ExecuTorch MPS delegate forward pass
python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --check_correctness
# You should see following output: `Results between ExecuTorch forward pass with MPS backend and PyTorch forward pass for mv3_mps are matching!`

# Check performance between PyTorch MPS forward pass and ExecuTorch MPS forward pass
python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --no-use_fp16 --bench_pytorch
```

### Profiling:
1. [Optional] Generate an [ETRecord](./etrecord.rst) while you're exporting your model.
```bash
cd executorch
python3 -m examples.apple.mps.scripts.mps_example --model_name="mv3" --generate_etrecord -b
```
2. Run your Program on the ExecuTorch runtime and generate an [ETDump](./etdump.md).
```
./cmake-out/examples/apple/mps/mps_executor_runner --model_path mv3_mps_bundled_fp16.pte --bundled_program --dump-outputs
```
3. Create an instance of the Inspector API by passing in the ETDump you have sourced from the runtime along with the optionally generated ETRecord from step 1.
```bash
python3 -m sdk.inspector.inspector_cli --etdump_path etdump.etdp --etrecord_path etrecord.bin
```

## Deploying and Running on Device

***Step 1***. Create the ExecuTorch core and MPS delegate frameworks to link on iOS
```bash
cd executorch
./build/build_apple_frameworks.sh --mps
```

`mps_delegate.xcframework` will be in `cmake-out` folder, along with `executorch.xcframework` and `portable_delegate.xcframework`:
```bash
cd cmake-out && ls
```

***Step 2***. Link the frameworks into your XCode project:
Go to project Target’s `Build Phases` - `Link Binaries With Libraries`, click the **+** sign and add the frameworks: files located in `Release` folder.
- `executorch.xcframework`
- `portable_delegate.xcframework`
- `mps_delegate.xcframework`

From the same page, include the needed libraries for the MPS delegate:
- `MetalPerformanceShaders.framework`
- `MetalPerformanceShadersGraph.framework`
- `Metal.framework`

In this tutorial, you have learned how to lower a model to the MPS delegate, build the mps_executor_runner and run a lowered model through the MPS delegate, or directly on device using the MPS delegate static library.


## Frequently encountered errors and resolution.

If you encountered any bugs or issues following this tutorial please file a bug/issue on the [ExecuTorch repository](https://github.com/pytorch/executorch/issues), with hashtag **#mps**.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Building and Running ExecuTorch with Qualcomm AI Engine Direct Backend
# Qualcomm AI Engine Backend

In this tutorial we will walk you through the process of getting started to
build ExecuTorch for Qualcomm AI Engine Direct and running a model on it.
Expand Down
205 changes: 205 additions & 0 deletions docs/source/backends-vulkan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Vulkan Backend

The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
built on top of the cross-platform Vulkan GPU API standard. It is primarily
designed to leverage the GPU to accelerate model inference on Android devices,
but can be used on any platform that supports an implementation of Vulkan:
laptops, servers, and edge devices.

::::{note}
The Vulkan delegate is currently under active development, and its components
are subject to change.
::::

## What is Vulkan?

Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
It is designed to offer developers more explicit control over GPUs compared to
previous specifications in order to reduce overhead and maximize the
capabilities of the modern graphics hardware.

Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
desktop and mobile) in the market support Vulkan. Vulkan is also included in
Android from Android 7.0 onwards.

**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
provides a way to execute compute and graphics operations on a GPU, but does not
come with a built-in library of performant compute kernels.

## The Vulkan Compute Library

The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
provide GPU implementations for PyTorch operators via GLSL compute shaders.

The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
The core components of the PyTorch Vulkan backend were forked into ExecuTorch
and adapted for an AOT graph-mode style of model inference (as opposed to
PyTorch which adopted an eager execution style of model inference).

The components of the Vulkan Compute Library are contained in the
`executorch/backends/vulkan/runtime/` directory. The core components are listed
and described below:

```
runtime/
├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
└── graph/ .................. ComputeGraph class which implements graph mode inference
└── ops/ ................ Base directory for operator implementations
├── glsl/ ........... GLSL compute shaders
│ ├── *.glsl
│ └── conv2d.glsl
└── impl/ ........... C++ code to dispatch GPU compute shaders
├── *.cpp
└── Conv2d.cpp
```

## Features

The Vulkan delegate currently supports the following features:

* **Memory Planning**
* Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
* **Capability Based Partitioning**:
* A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
* **Support for upper-bound dynamic shapes**:
* Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering

In addition to increasing operator coverage, the following features are
currently in development:

* **Quantization Support**
* We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
* **Memory Layout Management**
* Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
* **Selective Build**
* We plan to make it possible to control build size by selecting which operators/shaders you want to build with

## End to End Example

To further understand the features of the Vulkan Delegate and how to use it,
consider the following end to end example with a simple single operator model.

### Compile and lower a model to the Vulkan Delegate

Assuming ExecuTorch has been set up and installed, the following script can be
used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.

Once ExecuTorch has been set up and installed, the following script can be used
to generate a simple model and lower it to the Vulkan delegate.

```
# Note: this script is the same as the script from the "Setting up ExecuTorch"
# page, with one minor addition to lower to the Vulkan backend.
import torch
from torch.export import export
from executorch.exir import to_edge

from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner

# Start with a PyTorch model that adds two input tensors (matrices)
class Add(torch.nn.Module):
def __init__(self):
super(Add, self).__init__()

def forward(self, x: torch.Tensor, y: torch.Tensor):
return x + y

# 1. torch.export: Defines the program with the ATen operator set.
aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))

# 2. to_edge: Make optimizations for Edge devices
edge_program = to_edge(aten_dialect)
# 2.1 Lower to the Vulkan backend
edge_program = edge_program.to_backend(VulkanPartitioner())

# 3. to_executorch: Convert the graph to an ExecuTorch program
executorch_program = edge_program.to_executorch()

# 4. Save the compiled .pte program
with open("vk_add.pte", "wb") as file:
file.write(executorch_program.buffer)
```

Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
using the `to_backend()` API. The Vulkan Delegate implements the
`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
that are supported by the Vulkan delegate, and separates compatible sections of
the model to be executed on the GPU.

This means the a model can be lowered to the Vulkan delegate even if it contains
some unsupported operators. This will just mean that only parts of the graph
will be executed on the GPU.


::::{note}
The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/supported_ops.py)
Vulkan partitioner code can be inspected to examine which ops are currently
implemented in the Vulkan delegate.
::::

### Build Vulkan Delegate libraries

The easiest way to build and test the Vulkan Delegate is to build for Android
and test on a local Android device. Android devices have built in support for
Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
compile the Vulkan Compute Library's GLSL compute shaders.

The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
when building with CMake.

First, make sure that you have the Android NDK installed; any NDK version past
NDK r19c should work. Note that the examples in this doc have been validated with
NDK r27b. The Android SDK should also be installed so that you have access to `adb`.

The instructions in this page assumes that the following environment variables
are set.

```shell
export ANDROID_NDK=<path_to_ndk>
# Select the appropriate Android ABI for your device
export ANDROID_ABI=arm64-v8a
# All subsequent commands should be performed from ExecuTorch repo root
cd <path_to_executorch_root>
# Make sure adb works
adb --version
```

To build and install ExecuTorch libraries (for Android) with the Vulkan
Delegate:

```shell
# From executorch root directory
(rm -rf cmake-android-out && \
pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=$ANDROID_ABI \
-DEXECUTORCH_BUILD_VULKAN=ON \
-DPYTHON_EXECUTABLE=python \
-Bcmake-android-out && \
cmake --build cmake-android-out -j16 --target install)
```

### Run the Vulkan model on device

::::{note}
Since operator support is currently limited, only binary arithmetic operators
will run on the GPU. Expect inference to be slow as the majority of operators
are being executed via Portable operators.
::::

Now, the partially delegated model can be executed (partially) on your device's
GPU!

```shell
# Build a model runner binary linked with the Vulkan delegate libs
cmake --build cmake-android-out --target vulkan_executor_runner -j32

# Push model to device
adb push vk_add.pte /data/local/tmp/vk_add.pte
# Push binary to device
adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin

# Run the model
adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
```
1 change: 0 additions & 1 deletion docs/source/build-run-mps.md

This file was deleted.

1 change: 0 additions & 1 deletion docs/source/build-run-vulkan.md

This file was deleted.

Loading
Loading