[RFC] Winograd NNPACK for the ARM_CPU backend

--with @ajtulloch
We're working on shipping TVM on android for some internal products that uses 3x3 conv heavily. We found that the winograd_nnpack + TVM approach the best for shipping for a broad variety of android devices. Here are the reasons:
- We're restricted to AOT compilation and can only ship one model (packed with TVM generated code) to all android devices.
- We did autotuning with the direct and winograd implementation on a raspberry pi (Cortex-A53) and found that NNPACK actually outperforms the best autotuned schedules for most of the layers
- In the case where AutoTVM does find better schedules on Cortex-A53, the performance does not necessarily transfer to other microarchitectures. Autotuning works best for a fixed CPU microarchitecture. When we ship a model, we care about its performance on wide variety of devices (Cortex-A7, A9, A35, A53, A57, A72, A73, A75, Qualcomm Kryo, Samsung Mongoose M1, M2, and Meerkat M3, to name a few). It is very hard to get a schedule that performs well across the board.

In the end, we decided to use NNPACK Winograd for all 3x3 conv and use TVM for the rest of layers (so we can fuse the layers and parallelize them). This gives us the best overall performances. At the same time, we're building up the infra to target ship model based on the CPU microarchitecture so we can leverage AutoTVM to get the best performance.

Here is our work in progress: https://github.com/hlu1/tvm/tree/winograd-nnpack-ARM. We would like to contribute back if there's interest from the community.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions