Skip to content

1997alireza/ML-Inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Efficient ResNet50 Inference on Apple M1 Pro (macOS)

Assignment Overview

Develop an efficient inference runtime for a desktop device, choosing:

  • Device: Macbook with M1 pro processor (an edge device with CPU/GPU/NPU units)
  • Engine(s): PyTorch, Core ML, ONNX Runtime (Core ML recommended)
  • Optimization function: Minimize latency and memory usage while preserving model accuracy
  • Approach: Baseline measurements + model/runtime modifications + benchmarking

I implemented a full pipeline: PyTorch → ONNX → Core ML conversions (including custom pass pipelines modification), and measured accuracy, inference time, and memory usage across variants.


Baseline Model

  • Model Used: ResNet50 (ImageNet-pretrained, from torchvision.models)
  • Method: PyTorch (MPS GPU)
  • Script: base.py

Implemented Optimizations

A. CoreML Optimized Versions

  • Script: prepare.py, coreml_optimized.py

  • Conversion from PyTorch: via traced TorchScript model

  • Variants using quantization and customizing the pipeline:

    • CoreML FP16 (default pipeline)
    • CoreML FP32 (default pipeline)
    • CoreML FP16 (custom pipeline)
    • CoreML FP32 (custom pipeline)
  • Custom Pipeline Passes Removed:

pipeline.remove_passes({
  "common::merge_consecutive_transposes",
  "common::cast_optimization", # only for precision float 32
  "common::add_int16_cast"
})

Rationale: These were deemed unnecessary for a standard CNN like ResNet50.

B. ONNX Optimized Versions

  • Script: onnx_optimized.py

  • Conversion from PyTorch: via torch.onnx.export

  • Variants:

    • ONNX FP16 (using CoreMLExecutionProvider)
    • ONNX FP32 (using CoreMLExecutionProvider)

Benchmark Methodology

  • Metrics Recorded:

    • Inference Latency (ms)
    • Accuracy (%) using a real dataset (subset of Imagenette)
    • Memory Usage (MB), measured via psutil
  • Batch Size: 1 (single image)

  • Runs: 100 inferences


Results Summary

Engine / Variant Accuracy Avg Latency (ms) Memory Usage (MB)
PyTorch (MPS) 98% 28 30-43
CoreML FP16 (default) 98% 7-8 27-29
CoreML FP16 (custom) 98% 7-8 27-29
CoreML FP32 (default) 98% 12-14 27.5-33
CoreML FP32 (custom) 98% 12-14 27-31
ONNX FP16 (CoreML backend) 98% 78-80 35-50
ONNX FP32 (CoreML backend) 98% 11-13 10-14

Analysis and Takeaways

  • Best Performing Variant: CoreML FP16 (default pipeline) offered the lowest latency with minimal memory use (~40% latency improvement).
  • Custom Pipeline: Removing passes had no positive effect on runtime/memory, possibly reduced compile time (not measured).
  • ONNX with CoreML backend: Performed well with FP32 and achieved the lowest memory usage but suffered latency when using FP16 due to precision conversion overhead.
  • Memory Tradeoffs: PyTorch MPS had higher variance in memory usage; ONNX FP16 had overhead due to conversion steps.

Final Recommendation

For macOS M1 pro edge devices, CoreML with:

  • Precision: FLOAT16
  • Engine: ML Program format
  • Compute Units: ALL (to use ANE/GPU/CPU)

provides the best balance of speed and efficiency for ResNet50.


Files Included

  • base.py - PyTorch baseline benchmark
  • prepare.py - Export to ONNX and CoreML
  • coreml_optimized.py - CoreML model evaluation
  • onnx_optimized.py - ONNX model evaluation (using CoreML backend)
  • README.md - Overview

Further Optimized

  • Explore the operations and fuse the missing opportunities

  • Simplifying the graph by

    • Constant folding
    • Dead code elimination
    • Removing redundant operations
  • Model-level optimizations

    • Pruning
  • Profiling and detecting bottlenecks

  • ONNX engine seems to be more open and flexible to optimizations

About

Optimizing machine learning inference engines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages