Espresso: production ANE inference framework + new MIL gotchas to contribute

Hi @hollance,

Your "Everything we actually know about the Apple Neural Engine" repo has been an invaluable reference — thank you for documenting all of that. It's been essential to our research.

I'm building **Espresso** (https://github.com/christopherkarani/Espresso), a pure-Swift inference framework for Apple Silicon that uses the private ANE APIs you've documented to achieve 4.76x faster inference than CoreML (519 tok/s on M3 Max).

We've validated much of what you've documented and discovered a few additional gotchas we'd love to contribute back:

- `softmax` on non-power-of-2 dimensions → `InvalidMILProgram` at compile time
- `slice_by_index` on function inputs combined with RMSNorm+convs → `InvalidMILProgram` (workaround: prepare data in the right layout before passing to the function)
- Lane-packed attention kernels (spatial=32) necessary for stable ANE eval across M1-M4
- `reduce_mean` does NOT exist in raw MIL text format — use `reduce_sum` + `mul` by 1/N
- ANE eval unstable on some hosts even for single-input identity kernels

Would you be open to:
1. Adding a reference to Espresso in the README as a production inference usage example?
2. A PR contributing these findings to your docs?

Happy to submit the PR regardless of the README link. This community benefits from shared knowledge, and your docs deserve to stay current.

— Chris
https://github.com/christopherkarani/Espresso

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Espresso: production ANE inference framework + new MIL gotchas to contribute #44

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Espresso: production ANE inference framework + new MIL gotchas to contribute #44

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions