Description
🐛 Describe the bug
I attempted (with mixed success) to deploy ARM examples to a regular MCU, without any special DSP or NN accelerator. I chose RP2040, because it's build system is centered around CMake, so it was easier to modify existing example.
I uploaded my code to https://github.com/AIWintermuteAI/executorch/tree/port-to-rp2040, it should be easy enough to reproduce following the instructions.
For convenience, I'm also copying the results and the issues encountered here.
Softmax builds and runs normally
cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=softmax --aot_arm_compiler_flags=""
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 960 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 32.
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
I [executorch:arm_executor_runner.cpp:483] Input prepared.
I [executorch:arm_executor_runner.cpp:485] Starting the model execution...
I [executorch:arm_executor_runner.cpp:492] model_pte_loaded_size: 960 bytes.
I [executorch:arm_executor_runner.cpp:506] method_allocator_used: 342 / 1024 free: 682 ( used: 33 % )
I [executorch:arm_executor_runner.cpp:513] method_allocator_planned: 32 bytes
I [executorch:arm_executor_runner.cpp:515] method_allocator_loaded: 290 bytes
I [executorch:arm_executor_runner.cpp:516] method_allocator_input: 20 bytes
I [executorch:arm_executor_runner.cpp:517] method_allocator_executor: 0 bytes
I [executorch:arm_executor_runner.cpp:520] temp_allocator_used: 0 / 1024 free: 1024 ( used: 0 % )
I [executorch:arm_executor_runner.cpp:536] Model executed successfully.
I [executorch:arm_executor_runner.cpp:540] 1 outputs:
Output[0][0]: 0.500000
Output[0][1]: 0.500000
Output[0][2]: 0.500000
Output[0][3]: 0.500000
I [executorch:arm_executor_runner.cpp:577] Program complete, exiting.
I [executorch:arm_executor_runner.cpp:581]
Linear and add hang at Starting the model execution.
cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=linear --aot_arm_compiler_flags=""
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0 <
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 1596 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 144.
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
I [executorch:arm_executor_runner.cpp:483] Input prepared.
I [executorch:arm_executor_runner.cpp:485] Starting the model execution...
Quantized MobileNetv2 alpha 0.05 96x96x3 requires allocation of 1.45 Mb of RAM.
cd examples/arm
./run.sh --build_only --scratch-dir=build-dir --model_name=mv2_untrained --aot_arm_compiler_flags="--quantize"
I [executorch:arm_executor_runner.cpp:325] BLINK
I [executorch:arm_executor_runner.cpp:386] Model in 200014B0 <
I [executorch:arm_executor_runner.cpp:388] Model PTE file loaded. Size: 175008 bytes.
I [executorch:arm_executor_runner.cpp:398] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:406] Running method forward
I [executorch:arm_executor_runner.cpp:417] Setup Method allocator pool. Size: 1024 bytes.
I [executorch:arm_executor_runner.cpp:434] Setting up planned buffer 0, size 1785600.
E [executorch:memory_allocator.h:88] Memory allocation failed: 1785600B requested (adjusted for alignment), 1024B available
E [executorch:memory_allocator.h:88] Memory allocation failed: 68208B requested (adjusted for alignment), 1024B available
I [executorch:arm_executor_runner.cpp:459] Loading of method forward failed with status 0x21
I [executorch:arm_executor_runner.cpp:467] Method loaded.
I [executorch:arm_executor_runner.cpp:469] Preparing inputs...
F [executorch:result.h:165] In function CheckOk(), assert failed: hasValue_
- I do not clearly understand why linear and add models fail to run on the hardware, while softmax succeeds.
- Also the 1.45 Mb allocation for quantized MobileNetv2 alpha 0.05 96x96x3 seems excessive... Is that indeed current limitation due to executorch engine overhead or have I made a mistake?
Related issue:
#3585
Some work being done here (thanks for the support, @ChristophKarlHeck!)
https://github.com/ChristophKarlHeck/mbed-torch-fusion-os/tree/main
But I'm also seeing only softmax example - @ChristophKarlHeck were you able to make other models work on M4?
CC @zingo as I think you also worked on ARM example?
Versions
executorch % python collect_env.py
Collecting environment information...
PyTorch version: 2.6.0.dev20241112
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 14.6.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: version 3.30.5
Libc version: N/A
Python version: 3.10.15 (main, Oct 3 2024, 02:24:49) [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.6.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1 Pro
Versions of relevant libraries:
[pip3] executorch==0.5.0a0+d243ffe
[pip3] numpy==1.21.3
[pip3] torch==2.6.0.dev20241112
[pip3] torchaudio==2.5.0.dev20241112
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20241112
[conda] executorch 0.5.0a0+d243ffe pypi_0 pypi
[conda] numpy 1.21.3 pypi_0 pypi
[conda] torch 2.6.0.dev20241112 pypi_0 pypi
[conda] torchaudio 2.5.0.dev20241112 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.20.0.dev20241112 pypi_0 pypi
cc @larryliu0820 @lucylq @digantdesai @freddan80 @per @zingo @oscarandersson8218