Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs et remove loop from executor #828

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 24 additions & 15 deletions docs/source/build-run-xtensa.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,18 @@
# Building and Running ExecuTorch on Xtensa HiFi4 DSP


In this tutorial we will walk you through the process of getting setup to
build ExecuTorch for an Xtensa Hifi4 DSP and running a simple model on it.
In this tutorial we will walk you through the process of getting setup to build ExecuTorch for an Xtensa HiFi4 DSP and running a simple model on it.

The [Xtensa HiFi4 DSP](https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/hifi-dsps/hifi-4.html) is a DSP built by Cadence that is optimized for running audio based Neural networks such as wake word detection, Automatic speech recognition (ASR) etc. The HiFi NN library offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how these operations can be accelerated.
[Cadence](https://www.cadence.com/en_US/home.html) is both a hardware and software vendor, providing solutions for many computational workloads, including to run on power-limited embedded devices. The [Xtensa HiFi4 DSP](https://www.cadence.com/en_US/home/tools/ip/tensilica-ip/hifi-dsps/hifi-4.html) is a Digital Signal Processor (DSP) that is optimized for running audio based neural networks such as wake word detection, Automatic Speech Recognition (ASR), etc.

On top of being able to run on the Xtensa HiFi4 DSP another goal of this tutorial is to demonstrate how portable ExecuTorch is and its ability to run on a low-power embedded device such as the Xtensa HiFi4 DSP. This workflow does not require any delegates, it uses custom operators and compiler passes to enhance the model and make it more suitable to running on Xtensa HiFi4 DSPs. Finally, custom kernels optimized with Xtensa intrinsics provide runtime acceleration.
In addition to the chip, the HiFi4 Neural Network Library ([nnlib](https://github.com/foss-xtensa/nnlib-hifi4)) offers an optimized set of library functions commonly used in NN processing that we utilize in this example to demonstrate how common operations can be accelerated.

On top of being able to run on the Xtensa HiFi4 DSP, another goal of this tutorial is to demonstrate how portable ExecuTorch is and its ability to run on a low-power embedded device such as the Xtensa HiFi4 DSP. This workflow does not require any delegates, it uses custom operators and compiler passes to enhance the model and make it more suitable to running on Xtensa HiFi4 DSPs. A custom [quantizer](https://pytorch.org/tutorials/prototype/quantization_in_pytorch_2_0_export_tutorial.html) is used to represent activations and weights as `uint8` instead of `float`, and call appropriate operators. Finally, custom kernels optimized with Xtensa intrinsics provide runtime acceleration.

::::{grid} 2
:::{grid-item-card} What you will learn in this tutorial:
:class-card: card-prerequisites
* In this tutorial you will learn how to export a quantized model with linear and batch norm ops targeted for the Xtensa HiFi4 DSP.
:class-card: card-learn
* In this tutorial you will learn how to export a quantized model with a linear operation targeted for the Xtensa HiFi4 DSP.
* You will also learn how to compile and deploy the ExecuTorch runtime with the kernels required for running the quantized model generated in the previous step on the Xtensa HiFi4 DSP.
:::
:::{grid-item-card} Tutorials we recommend you complete before this:
Expand All @@ -53,6 +54,10 @@ On top of being able to run on the Xtensa HiFi4 DSP another goal of this tutoria
:::
::::

```{note}
The linux part of this tutorial has been designed and tested on Ubuntu 22.04 LTS, and requires glibc 2.34. Workarounds are available for other distributions, but will not be covered in this tutorial.
```

## Prerequisites (Hardware and Software)

In order to be able to succesfully build and run ExecuTorch on a Xtensa HiFi4 DSP you'll need the following hardware and software components.
Expand All @@ -65,11 +70,13 @@ In order to be able to succesfully build and run ExecuTorch on a Xtensa HiFi4 DS
- [MCUXpresso IDE](https://www.nxp.com/design/software/development-software/mcuxpresso-software-and-tools-/mcuxpresso-integrated-development-environment-ide:MCUXpresso-IDE)
- This IDE is supported on multiple platforms including MacOS. You can use it on any of the supported platforms as you'll only be using this to flash the board with the DSP images that you'll be building later on in this tutorial.
- [J-Link](https://www.segger.com/downloads/jlink/)
- Needed to flash the board with the firmaware images. You can install this on the same platform that you installed the MCUXpresso IDE on.
- Needed to flash the board with the firmware images. You can install this on the same platform that you installed the MCUXpresso IDE on.
- Note: depending on the version of the NXP board, another probe than JLink might be installed. In any case, flashing is done using the MCUXpresso IDE in a similar way.
- [MCUXpresso SDK](https://mcuxpresso.nxp.com/en/select?device=EVK-MIMXRT685)
- Download this SDK to your Linux machine, extract it and take a note of the path where you store it. You'll need this later.
- [Xtensa compiler](https://tensilicatools.com/platform/i-mx-rt600/)
- Download this to your Linux machine. This is needed to build ExecuTorch for the HiFi4 DSP.
- For cases with optimized kernels, the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4).

## Setting up Developer Environment

Expand Down Expand Up @@ -97,11 +104,11 @@ examples

***AoT (Ahead-of-Time) Components***:

The AoT folder contains all of the python scripts and functions needed to export the model to an executorch `.pte` file. In our case, [export_example.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/export_example.py) defines a model and some example inputs (set to a vector of ones), and runs it through the quantizer (from [quantizer.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/quantizer.py)). Then a few compiler passes, also defined in [quantizer.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/quantizer.py), will replace operators with custom ones that are supported and optimized on the chip. Any operator needed to compute things should be defined in [meta_registrations.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/meta_registrations.py) and have corresponding implemetations in the other folders.
The AoT folder contains all of the python scripts and functions needed to export the model to an ExecuTorch `.pte` file. In our case, [export_example.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/export_example.py) defines a model and some example inputs (set to a vector of ones), and runs it through the quantizer (from [quantizer.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/quantizer.py)). Then a few compiler passes, also defined in [quantizer.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/quantizer.py), will replace operators with custom ones that are supported and optimized on the chip. Any operator needed to compute things should be defined in [meta_registrations.py](https://github.com/pytorch/executorch/blob/main/examples/xtensa/aot/meta_registrations.py) and have corresponding implemetations in the other folders.

***Operators***:

The operators folder contains two kinds of operators: existing operators from the [executorch portable library](https://github.com/pytorch/executorch/tree/main/kernels/portable/cpu) and new operators that define custom computations. The former is simply dispatching the operator to the relevant executorch implementation, while the latter acts as an interface, setting up everything needed for the custom kernels to compute the outputs.
The operators folder contains two kinds of operators: existing operators from the [ExecuTorch portable library](https://github.com/pytorch/executorch/tree/main/kernels/portable/cpu) and new operators that define custom computations. The former is simply dispatching the operator to the relevant ExecuTorch implementation, while the latter acts as an interface, setting up everything needed for the custom kernels to compute the outputs.

***Kernels***:

Expand Down Expand Up @@ -147,7 +154,7 @@ export TOOLCHAIN_VER=RI-2021.8-linux
export XTENSA_CORE=nxp_rt600_RI2021_8_newlib
```

***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4)
***Step 2***. Clone the [nnlib repo](https://github.com/foss-xtensa/nnlib-hifi4), which contains optimized kernels and primitives for HiFi4 DSPs, with `git clone git@github.com:foss-xtensa/nnlib-hifi4.git`.

***Step 3***. Run the CMake build.
In order to run the CMake build, you need the path to the following:
Expand Down Expand Up @@ -175,6 +182,10 @@ cmake-xt/dsp_data_release.bin cmake-xt/dsp_text_release.bin

<img src="_static/img/dsp_binary.png" alt="MCUXpresso IDE" /><br>

```{note}
As long as binaries have been built using the Xtensa toolchain on Linux, flashing the board and running on the chip can be done only with the MCUXpresso IDE, which is available on all platforms (Linux, MacOS, Windows).
```

***Step 2***. Clean your work space

***Step 3***. Click **Debug your Project** which will flash the board with your binaries.
Expand All @@ -183,12 +194,10 @@ On the UART console connected to your board (at a default baud rate of 115200),

```bash
> screen /dev/tty.usbmodem0007288234991 115200
Booted up in DSP.
ET: Model buffer loaded, has 1 methods
ET: Running method forward
Method loaded.
Starting the model execution...
Executed model
Model executed successfully.
First 20 elements of output 0
0.165528 0.331055 ...
```

## Conclusion and Future Work
Expand Down
42 changes: 20 additions & 22 deletions examples/xtensa/executor_runner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -180,31 +180,29 @@ int main(int argc, char** argv) {
torch::executor::util::PrepareInputTensors(*method);
ET_LOG(Info, "Starting the model execution...");

while (1) {
Error status = method->execute();
ET_LOG(Info, "Executed model");
if (status != Error::Ok) {
ET_LOG(
Error,
"Execution of method %s failed with status 0x%" PRIx32,
method_name,
status);
} else {
ET_LOG(Info, "Model executed successfully.");
}
Error status = method->execute();
ET_LOG(Info, "Executed model");
if (status != Error::Ok) {
ET_LOG(
Error,
"Execution of method %s failed with status 0x%" PRIx32,
method_name,
status);
} else {
ET_LOG(Info, "Model executed successfully.");
}

std::vector<EValue> outputs(method->outputs_size());
status = method->get_outputs(outputs.data(), outputs.size());
for (int i = 0; i < outputs.size(); ++i) {
float* out_data_ptr = (float*)outputs[i].toTensor().const_data_ptr();
ET_LOG(Info, "First 20 elements of output %d", i);
for (size_t j = 0; j < 20; j++) {
PRINTF("%f \t", out_data_ptr[j]);
}
std::vector<EValue> outputs(method->outputs_size());
status = method->get_outputs(outputs.data(), outputs.size());
for (int i = 0; i < outputs.size(); ++i) {
float* out_data_ptr = (float*)outputs[i].toTensor().const_data_ptr();
ET_LOG(Info, "First 20 elements of output %d", i);
for (size_t j = 0; j < 20; j++) {
ET_LOG(Info, "%f \n", out_data_ptr[j]);
}
delay();
LED_TOGGLE();
}
delay();
LED_TOGGLE();

return 0;
}