intel
diff --git a/‎docs/distillation.md‎
Lines changed: 13 additions & 0 deletions b/‎docs/distillation.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/imgs/self-distillation.png‎
311 KB b/‎docs/imgs/self-distillation.png‎
311 KB
diff --git a/‎examples/README.md‎
Lines changed: 20 additions & 0 deletions b/‎examples/README.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎examples/pytorch/image_recognition/torchvision_models/self_distillation/eager/README.md‎
Lines changed: 22 additions & 0 deletions b/‎examples/pytorch/image_recognition/torchvision_models/self_distillation/eager/README.md‎
Lines changed: 22 additions & 0 deletions
@@ -7,6 +7,8 @@ Distillation
 
     1.2. [Intermediate Layer Knowledge Distillation](#intermediate-layer-knowledge-distillation)
 
+    1.3. [Self Distillation](#self-distillation)
+
 2. [Distillation Support Matrix](#distillation-support-matrix)
 3. [Get Started with Distillation API ](#get-started-with-distillation-api)
 4. [Examples](#examples)
@@ -35,12 +37,23 @@ $$L_{KD} = \sum\limits_i D(T_t^{n_i}(F_t^{n_i}), T_s^{m_i}(F_s^{m_i}))$$
 
 Where $D$ is a distance measurement as before, $F_t^{n_i}$ the output feature of the $n_i$'s layer of the teacher model, $F_s^{m_i}$ the output feature of the $m_i$'s layer of the student model. Since the dimensions of $F_t^{n_i}$ and $F_s^{m_i}$ are usually different, the transformations $T_t^{n_i}$ and $T_s^{m_i}$ are needed to match dimensions of the two features. Specifically, the transformation can take the forms like identity, linear transformation, 1X1 convolution etc.
 
+### Self Distillation
+
+Self-distillation ia a one-stage training method where the teacher model and student models can be trained together. It attaches several attention modules and shallow classifiers at different depths of neural networks and distills knowledge from the deepest classifier to the shallower classifiers. Different from the conventional knowledge distillation methods where the knowledge of the teacher model is transferred to another student model, self-distillation can be considered as knowledge transfer in the same model, from the deeper layers to the shallower layers.
+The additional classifiers in self-distillation allow the neural network to work in a dynamic manner, which leads to a much higher acceleration.
+<br>
+
+<img src="./imgs/self-distillation.png" alt="Architecture" width=800 height=350>
+
+Architecture from paper [Self-Distillation: Towards Efficient and Compact Neural Networks](https://ieeexplore.ieee.org/document/9381661)
+
 ## Distillation Support Matrix
 
 |Distillation Algorithm                          |PyTorch   |TensorFlow |
 |------------------------------------------------|:--------:|:---------:|
 |Knowledge Distillation                          |&#10004;  |&#10004;   |
 |Intermediate Layer Knowledge Distillation       |&#10004;  |Will be supported|
+|Self Distillation                               |&#10004;  |&#10006;   |
 
 ## Get Started with Distillation API 
 
 
@@ -336,6 +336,7 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <th>Student Model</th>
     <th>Teacher Model</th>
     <th>Domain</th>
+    <th>Approach </th>
     <th>Examples</th>
   </tr>
 </thead>
@@ -344,6 +345,7 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <td>MobileNet</td>
     <td>DenseNet201</td>
     <td>Image Recognition</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./tensorflow/image_recognition/tensorflow_models/distillation">pb</a></td>
   </tr>
 </tbody>
@@ -613,6 +615,7 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <th>Student Model</th>
     <th>Teacher Model</th>
     <th>Domain</th>
+    <th>Approach</th>
     <th>Examples</th>
   </tr>
 </thead>
@@ -621,60 +624,77 @@ Intel® Neural Compressor validated examples with multiple compression technique
     <td>CNN-2</td>
     <td>CNN-10</td>
     <td>Image Recognition</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/image_recognition/CNN-2/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>MobileNet V2-0.35</td>
     <td>WideResNet40-2</td>
     <td>Image Recognition</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/image_recognition/MobileNetV2-0.35/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>ResNet18|ResNet34|ResNet50|ResNet101</td>
     <td>ResNet18|ResNet34|ResNet50|ResNet101</td>
     <td>Image Recognition</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/image_recognition/torchvision_models/distillation/eager">eager</a></td>
   </tr>
+  <tr>
+    <td>ResNet18|ResNet34|ResNet50|ResNet101</td>
+    <td>ResNet18|ResNet34|ResNet50|ResNet101</td>
+    <td>Image Recognition</td>
+    <td>Self Distillation</td>
+    <td><a href="./pytorch/image_recognition/torchvision_models/self_distillation/eager">eager</a></td>
+  </tr>
   <tr>
     <td>VGG-8</td>
     <td>VGG-13</td>
     <td>Image Recognition</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/image_recognition/VGG-8/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>BlendCNN</td>
     <td>BERT-Base</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/blendcnn/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>DistilBERT</td>
     <td>BERT-Base</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/huggingface_models/question-answering/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>BiLSTM</td>
     <td>RoBERTa-Base</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/huggingface_models/text-classification/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>TinyBERT</td>
     <td>BERT-Base</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/huggingface_models/text-classification/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>BERT-3</td>
     <td>BERT-Base</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/huggingface_models/text-classification/distillation/eager">eager</a></td>
   </tr>
   <tr>
     <td>DistilRoBERTa</td>
     <td>RoBERTa-Large</td>
     <td>Natural Language Processing</td>
+    <td>Knowledge Distillation</td>
     <td><a href="./pytorch/nlp/huggingface_models/text-classification/distillation/eager">eager</a></td>
   </tr>
 </tbody>
 
@@ -0,0 +1,22 @@
+Details **TBD**
+### Prepare requirements
+```shell
+pip install -r requirements.txt
+```
+### Run self distillation
+```shell
+bash run_distillation.sh --topology=(resnet18|resnet34|resnet50|resnet101) --config=conf.yaml --output_model=path/to/output_model --dataset_location=path/to/dataset --use_cpu=(0|1)
+```
+### CIFAR100 benchmark
+https://github.com/weiaicunzai/pytorch-cifar100
+
+### Paper:
+[Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation](https://openaccess.thecvf.com/content_ICCV_2019/html/Zhang_Be_Your_Own_Teacher_Improve_the_Performance_of_Convolutional_Neural_ICCV_2019_paper.html)
+
+[Self-Distillation: Towards Efficient and Compact Neural Networks](https://ieeexplore.ieee.org/document/9381661)
+
+### Our results in CIFAR100
+| model    | Baseline | Classifier1 | Classifier2 | Classifier3 | Classifier4 | Ensemble |
+| :------: | :-------:| :---------: | :---------: | :---------: | :---------: | :------: |
+| Resnet50 |  80.88   |    82.06    |   83.64     |    83.85    |    83.41    |  85.10   |
+