Skip to content

[RFC] Winograd NNPACK for the ARM_CPU backend #2692

@hlu1

Description

@hlu1

--with @ajtulloch
We're working on shipping TVM on android for some internal products that uses 3x3 conv heavily. We found that the winograd_nnpack + TVM approach the best for shipping for a broad variety of android devices. Here are the reasons:

  • We're restricted to AOT compilation and can only ship one model (packed with TVM generated code) to all android devices.
  • We did autotuning with the direct and winograd implementation on a raspberry pi (Cortex-A53) and found that NNPACK actually outperforms the best autotuned schedules for most of the layers
  • In the case where AutoTVM does find better schedules on Cortex-A53, the performance does not necessarily transfer to other microarchitectures. Autotuning works best for a fixed CPU microarchitecture. When we ship a model, we care about its performance on wide variety of devices (Cortex-A7, A9, A35, A53, A57, A72, A73, A75, Qualcomm Kryo, Samsung Mongoose M1, M2, and Meerkat M3, to name a few). It is very hard to get a schedule that performs well across the board.

In the end, we decided to use NNPACK Winograd for all 3x3 conv and use TVM for the rest of layers (so we can fuse the layers and parallelize them). This gives us the best overall performances. At the same time, we're building up the infra to target ship model based on the CPU microarchitecture so we can leverage AutoTVM to get the best performance.

Here is our work in progress: https://github.com/hlu1/tvm/tree/winograd-nnpack-ARM. We would like to contribute back if there's interest from the community.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions