Skip to content

CPU Overhead After ANE Execution #8445

Open
@YIWENX14

Description

@YIWENX14

🐛 Describe the bug

We are currently doing a performance test for static llama (https://github.com/pytorch/executorch/pull/8436/files) running on iPhone 14 Pro. From profiling of extracted MLpackage using the profile tool, we observed an average inference time of 38.11ms, this includes ~12ms overhead on CPU after ANE execution. Could you please help us understand what is causing this overhead on CPU and why it is taking so long?

The per-operation breakdown revealed the following execution steps:

  1. 1 op (152) runs on CPU.
  2. Several ops run on ANE
  3. 9 ops (315-336) run on CPU.
  4. The remaining ops run on ANE.
  5. The final linear op (2504) runs on CPU (profile tool reports it takes 80us).

At the last stage of execution, we noticed a duration of ~12ms on CPU (even though final CPU op takes only 80us to run), as indicated by the instrument tool model-iPhone14pro.mlperf.zip:

|<--2.8ms-->|<---2ms--->|<---2ms--->|<-------------18ms------------>|<-------12ms------>|
|<---CPU--->|<---ANE--->|<---CPU--->|<-------------ANE------------->|<-------CPU------->|
Image Flame graph of the 12ms region: Image

Repro

To export the model, check out https://github.com/pytorch/executorch/pull/8436/files and run the following from executorch/examples/apple/coreml/llama (static_seq_len=1 and max_seq_length=1024 for repro)

python export.py -n /path/to/output.pte -p /path/to/params.json -c /path/to/model.pth --static_seq_length 1 --max_seq_length 1024 -E"4,32" --coreml-quantize "c4w"

After the pte file is generated, extract the mlpackage using executorch/examples/apple/coreml/scripts/extract_coreml_models.py and profile it in the CoreML profile tool (there is only one CoreML model in the package).

Versions

PyTorch version: 2.7.0.dev20250131
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.2 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.2
Libc version: N/A

Python version: 3.10.0 (default, Mar 3 2022, 03:54:28) [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-15.2-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] executorch==0.6.0a0+775c394
[pip3] executorchcoreml==0.0.1
[pip3] numpy==2.2.2
[pip3] torch==2.7.0.dev20250131
[pip3] torchao==0.8.0+git11333ba
[pip3] torchaudio==2.6.0.dev20250131
[pip3] torchsr==1.0.4
[pip3] torchvision==0.22.0.dev20250131
[conda] executorch 0.6.0a0+775c394 pypi_0 pypi
[conda] executorchcoreml 0.0.1 pypi_0 pypi
[conda] numpy 2.2.2 pypi_0 pypi
[conda] torch 2.7.0.dev20250131 pypi_0 pypi
[conda] torchao 0.8.0+git11333ba pypi_0 pypi
[conda] torchaudio 2.6.0.dev20250131 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchvision 0.22.0.dev20250131 pypi_0 pypi

cc @kimishpatel @YifanShenSZ @cymbalrush

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: coremlIssues related to Apple's Core ML delegation and code under backends/apple/coreml/triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions