Model Optimisation Techniques Implementation: Quantisation, Compression Techniques with Knowledge Distillation
This repository contains a curated list of different ways of optimization and compression techiniques for the Machine Learning Models.
IIIT-Hyderabad College Project on exploring these techniques.
Here is a demo of the project:
Model Compression and Architecture Optimization
-
Compression Techniques
- Pruning
- Quantization
- hashing
- Knowledge Distillation
- Low-Rank Approximation
- Precision reduction [ Floating Point Operation, Floating Point Operations per Second, Multiply-Accumulate Computations [ 1 MAC = 2 FLOPs ] ]
-
Architecture Optimization
- Architecture Changes
- Neural Architecture Search
List of all possible ways of optimization
-
Pruning : Removing redundant connections present in the architecture. Pruning involves cutting out unimportant weights (which are usually defined as weights with small absolute value).
- Unstructured Pruning
- Structured Pruning
-
Quantization: Quantization involves bundling weights together by clustering them or rounding them off so that the same number of connections can be represented using lesser amount of memory.
- Dynamic Quantization
- Static Quantization
- Quantization Aware training
-
ONNX conversion and ONNX Runtime
-
Distillation
-
coreML for mobile device
-
Neural Architecture Search (NAS)
-
Low-Rank Approximation
