CPU Overhead After ANE Execution

### 🐛 Describe the bug

We are currently doing a performance test for static llama (https://github.com/pytorch/executorch/pull/8436/files) running on iPhone 14 Pro. From profiling of extracted MLpackage using the profile tool, we observed an average inference time of 38.11ms, this includes ~12ms overhead on CPU after ANE execution. Could you please help us understand what is causing this overhead on CPU and why it is taking so long?

The per-operation breakdown revealed the following execution steps:
1. 1 op (152) runs on CPU.
2. Several ops run on ANE
3. 9 ops (315-336) run on CPU.
4. The remaining ops run on ANE.
5. The final linear op (2504) runs on CPU (profile tool reports it takes 80us).

At the last stage of execution, we noticed a duration of ~12ms on CPU (even though final CPU op takes only 80us to run), as indicated by the instrument tool [model-iPhone14pro.mlperf.zip](https://github.com/user-attachments/files/18776736/model-iPhone14pro.mlperf.zip):
```
|<--2.8ms-->|<---2ms--->|<---2ms--->|<-------------18ms------------>|<-------12ms------>|
|<---CPU--->|<---ANE--->|<---CPU--->|<-------------ANE------------->|<-------CPU------->|
```
<img width="1211" alt="Image" src="https://github.com/user-attachments/assets/9cbaefea-c8f9-4c0a-892f-5f00d01a6965" />
Flame graph of the 12ms region:
<img width="1315" alt="Image" src="https://github.com/user-attachments/assets/f0694d37-40b7-4dcf-856b-9371552dd270" />

### Repro
To export the model, check out https://github.com/pytorch/executorch/pull/8436/files and run the following from `executorch/examples/apple/coreml/llama` (`static_seq_len=1` and `max_seq_length=1024` for repro)
```
python export.py -n /path/to/output.pte -p /path/to/params.json -c /path/to/model.pth --static_seq_length 1 --max_seq_length 1024 -E"4,32" --coreml-quantize "c4w"
```

After the pte file is generated, extract the mlpackage using `executorch/examples/apple/coreml/scripts/extract_coreml_models.py` and profile it in the CoreML profile tool (there is only one CoreML model in the package).

### Versions

PyTorch version: 2.7.0.dev20250131
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.2
Libc version: N/A

Python version: 3.10.0 (default, Mar  3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+775c394
[pip3] executorchcoreml==0.0.1
[pip3] numpy==2.2.2
[pip3] torch==2.7.0.dev20250131
[pip3] torchao==0.8.0+git11333ba
[pip3] torchaudio==2.6.0.dev20250131
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250131
[conda] executorch                0.6.0a0+775c394          pypi_0    pypi
[conda] executorchcoreml          0.0.1                    pypi_0    pypi
[conda] numpy                     2.2.2                    pypi_0    pypi
[conda] torch                     2.7.0.dev20250131          pypi_0    pypi
[conda] torchao                   0.8.0+git11333ba          pypi_0    pypi
[conda] torchaudio                2.6.0.dev20250131          pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchvision               0.22.0.dev20250131          pypi_0    pypi

cc @kimishpatel @YifanShenSZ @cymbalrush

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU Overhead After ANE Execution #8445

🐛 Describe the bug

Repro

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU Overhead After ANE Execution #8445

Description

🐛 Describe the bug

Repro

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions