Description
🐛 Describe the bug
We are currently doing a performance test for static llama (https://github.com/pytorch/executorch/pull/8436/files) running on iPhone 14 Pro. From profiling of extracted MLpackage using the profile tool, we observed an average inference time of 38.11ms, this includes ~12ms overhead on CPU after ANE execution. Could you please help us understand what is causing this overhead on CPU and why it is taking so long?
The per-operation breakdown revealed the following execution steps:
- 1 op (152) runs on CPU.
- Several ops run on ANE
- 9 ops (315-336) run on CPU.
- The remaining ops run on ANE.
- The final linear op (2504) runs on CPU (profile tool reports it takes 80us).
At the last stage of execution, we noticed a duration of ~12ms on CPU (even though final CPU op takes only 80us to run), as indicated by the instrument tool model-iPhone14pro.mlperf.zip:
|<--2.8ms-->|<---2ms--->|<---2ms--->|<-------------18ms------------>|<-------12ms------>|
|<---CPU--->|<---ANE--->|<---CPU--->|<-------------ANE------------->|<-------CPU------->|


Repro
To export the model, check out https://github.com/pytorch/executorch/pull/8436/files and run the following from executorch/examples/apple/coreml/llama
(static_seq_len=1
and max_seq_length=1024
for repro)
python export.py -n /path/to/output.pte -p /path/to/params.json -c /path/to/model.pth --static_seq_length 1 --max_seq_length 1024 -E"4,32" --coreml-quantize "c4w"
After the pte file is generated, extract the mlpackage using executorch/examples/apple/coreml/scripts/extract_coreml_models.py
and profile it in the CoreML profile tool (there is only one CoreML model in the package).
Versions
PyTorch version: 2.7.0.dev20250131
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.2
Libc version: N/A
Python version: 3.10.0 (default, Mar 3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Apple M1 Pro
Versions of relevant libraries:
[pip3] executorch==0.6.0a0+775c394
[pip3] executorchcoreml==0.0.1
[pip3] numpy==2.2.2
[pip3] torch==2.7.0.dev20250131
[pip3] torchao==0.8.0+git11333ba
[pip3] torchaudio==2.6.0.dev20250131
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250131
[conda] executorch 0.6.0a0+775c394 pypi_0 pypi
[conda] executorchcoreml 0.0.1 pypi_0 pypi
[conda] numpy 2.2.2 pypi_0 pypi
[conda] torch 2.7.0.dev20250131 pypi_0 pypi
[conda] torchao 0.8.0+git11333ba pypi_0 pypi
[conda] torchaudio 2.6.0.dev20250131 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.22.0.dev20250131 pypi_0 pypi