Develop an efficient inference runtime for a desktop device, choosing:
- Device: Macbook with M1 pro processor (an edge device with CPU/GPU/NPU units)
- Engine(s): PyTorch, Core ML, ONNX Runtime (Core ML recommended)
- Optimization function: Minimize latency and memory usage while preserving model accuracy
- Approach: Baseline measurements + model/runtime modifications + benchmarking
I implemented a full pipeline: PyTorch → ONNX → Core ML conversions (including custom pass pipelines modification), and measured accuracy, inference time, and memory usage across variants.
- Model Used: ResNet50 (ImageNet-pretrained, from
torchvision.models) - Method: PyTorch (MPS GPU)
- Script:
base.py
-
Script:
prepare.py,coreml_optimized.py -
Conversion from PyTorch: via traced TorchScript model
-
Variants using quantization and customizing the pipeline:
- CoreML FP16 (default pipeline)
- CoreML FP32 (default pipeline)
- CoreML FP16 (custom pipeline)
- CoreML FP32 (custom pipeline)
-
Custom Pipeline Passes Removed:
pipeline.remove_passes({
"common::merge_consecutive_transposes",
"common::cast_optimization", # only for precision float 32
"common::add_int16_cast"
})Rationale: These were deemed unnecessary for a standard CNN like ResNet50.
-
Script:
onnx_optimized.py -
Conversion from PyTorch: via
torch.onnx.export -
Variants:
- ONNX FP16 (using CoreMLExecutionProvider)
- ONNX FP32 (using CoreMLExecutionProvider)
-
Metrics Recorded:
- Inference Latency (ms)
- Accuracy (%) using a real dataset (subset of Imagenette)
- Memory Usage (MB), measured via
psutil
-
Batch Size: 1 (single image)
-
Runs: 100 inferences
| Engine / Variant | Accuracy | Avg Latency (ms) | Memory Usage (MB) |
|---|---|---|---|
| PyTorch (MPS) | 98% | 28 | 30-43 |
| CoreML FP16 (default) | 98% | 7-8 |
27-29 |
| CoreML FP16 (custom) | 98% | 7-8 |
27-29 |
| CoreML FP32 (default) | 98% | 12-14 | 27.5-33 |
| CoreML FP32 (custom) | 98% | 12-14 | 27-31 |
| ONNX FP16 (CoreML backend) | 98% | 78-80 | 35-50 |
| ONNX FP32 (CoreML backend) | 98% | 11-13 | 10-14 |
- Best Performing Variant: CoreML FP16 (default pipeline) offered the lowest latency with minimal memory use (~40% latency improvement).
- Custom Pipeline: Removing passes had no positive effect on runtime/memory, possibly reduced compile time (not measured).
- ONNX with CoreML backend: Performed well with FP32 and achieved the lowest memory usage but suffered latency when using FP16 due to precision conversion overhead.
- Memory Tradeoffs: PyTorch MPS had higher variance in memory usage; ONNX FP16 had overhead due to conversion steps.
For macOS M1 pro edge devices, CoreML with:
- Precision: FLOAT16
- Engine: ML Program format
- Compute Units: ALL (to use ANE/GPU/CPU)
provides the best balance of speed and efficiency for ResNet50.
base.py- PyTorch baseline benchmarkprepare.py- Export to ONNX and CoreMLcoreml_optimized.py- CoreML model evaluationonnx_optimized.py- ONNX model evaluation (using CoreML backend)README.md- Overview
-
Explore the operations and fuse the missing opportunities
-
Simplifying the graph by
- Constant folding
- Dead code elimination
- Removing redundant operations
-
Model-level optimizations
- Pruning
-
Profiling and detecting bottlenecks
-
ONNX engine seems to be more open and flexible to optimizations