Skip to content

Espresso: production ANE inference framework + new MIL gotchas to contribute #44

@christopherkarani

Description

@christopherkarani

Hi @hollance,

Your "Everything we actually know about the Apple Neural Engine" repo has been an invaluable reference — thank you for documenting all of that. It's been essential to our research.

I'm building Espresso (https://github.com/christopherkarani/Espresso), a pure-Swift inference framework for Apple Silicon that uses the private ANE APIs you've documented to achieve 4.76x faster inference than CoreML (519 tok/s on M3 Max).

We've validated much of what you've documented and discovered a few additional gotchas we'd love to contribute back:

  • softmax on non-power-of-2 dimensions → InvalidMILProgram at compile time
  • slice_by_index on function inputs combined with RMSNorm+convs → InvalidMILProgram (workaround: prepare data in the right layout before passing to the function)
  • Lane-packed attention kernels (spatial=32) necessary for stable ANE eval across M1-M4
  • reduce_mean does NOT exist in raw MIL text format — use reduce_sum + mul by 1/N
  • ANE eval unstable on some hosts even for single-input identity kernels

Would you be open to:

  1. Adding a reference to Espresso in the README as a production inference usage example?
  2. A PR contributing these findings to your docs?

Happy to submit the PR regardless of the README link. This community benefits from shared knowledge, and your docs deserve to stay current.

— Chris
https://github.com/christopherkarani/Espresso

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions