Description
Weight quantization enables low-power edge devices to perform machine learning by trading a couple of percentage points of prediction accuracy for a very dramatic reduction in computation times.
Mobile devices will motivate dramatic optimizations and code changes to enable cross platform SIMD across very different devices. An alternative is to create an implementation that is fast(simple) enough and correct enough that most users never need to care about numerical performance.
The approach was initially made popular by XNOR-NET for real time edge vision classification.
https://arxiv.org/pdf/1603.05279.pdf
But this is a generalized approach and there have been deeper analyses of the best way to negotiate the bitwidth-vs-model-accuracy scale. The training side can compress until the accuracy drops too low as part of the automatic hyperparameter tuning. This means a slight increase in training time in order to achieve low-power, high-speed streaming inference on edge processors.