add doc/tutorial/assertion

microsoft · jeffra · Aug 16, 2021 · Aug 12, 2021 · Aug 12, 2021 · Aug 15, 2021
commit 7c8c93438db04458d3bb51a497c831cf9dfcf86a
diff --git a/README.md b/README.md
@@ -33,6 +33,7 @@ information [here](https://innovation.microsoft.com/en-us/exploring-ai-at-scale)
 
 
 # News
+* [2021/08/16] [Curriculum learning: a regularization method for stable and 2.6x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate](https://www.deepspeed.ai/tutorials/curriculum-learning/)
 * [2021/05/24] [DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression](https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/)
 * [2021/04/20] [1-bit LAMB: up to 4.6x less communication and 2.8x faster training, together with LAMB's convergence speed at large batch sizes](https://www.deepspeed.ai/tutorials/onebit-lamb/)
 * [2021/04/19] [ZeRO-Infinity unlocks unprecedented model scale for deep learning training](https://www.microsoft.com/en-us/research/blog/zero-infinity-and-deepspeed-unlocking-unprecedented-model-scale-for-deep-learning-training/)
@@ -148,6 +149,10 @@ overview](https://www.deepspeed.ai/features/) for descriptions and usage.
   * Learning Rate Range Test
   * 1Cycle Learning Rate Schedule
 * [Simplified Data Loader](https://www.deepspeed.ai/features/#simplified-data-loader)
+* [Curriculum Learning](https://www.deepspeed.ai/tutorials/curriculum-learning/)
+  * A curriculum learning-based data pipeline that presents easier or simpler examples earlier during training
+  * Stable and 2.6x faster GPT-2 pre-training with 8x/4x larger batch size/learning rate while maintaining token-wise convergence speed
+  * Complementary to many other DeepSpeed features
 * [Performance Analysis and Debugging](https://www.deepspeed.ai/features/#performance-analysis-and-debugging)
 
 
@@ -198,9 +203,10 @@ Conduct](https://opensource.microsoft.com/codeofconduct/). For more information
 2. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. (2020) DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. [In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '20, Tutorial)](https://dl.acm.org/doi/10.1145/3394486.3406703).
 3. Minjia Zhang, Yuxiong He. (2020) Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping. [arXiv:2010.13369](https://arxiv.org/abs/2010.13369) and [NeurIPS 2020](https://proceedings.neurips.cc/paper/2020/hash/a1140a3d0df1c81e24ae954d935e8926-Abstract.html).
 4. Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He. (2021) ZeRO-Offload: Democratizing Billion-Scale Model Training. [arXiv:2101.06840](https://arxiv.org/abs/2101.06840).
-5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888).
+5. Hanlin Tang, Shaoduo Gan, Ammar Ahmad Awan, Samyam Rajbhandari, Conglong Li, Xiangru Lian, Ji Liu, Ce Zhang, Yuxiong He. (2021) 1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. [arXiv:2102.02888](https://arxiv.org/abs/2102.02888) and [ICML 2021](http://proceedings.mlr.press/v139/tang21a.html).
 6. Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, Yuxiong He. (2021) ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning. [arXiv:2104.07857](https://arxiv.org/abs/2104.07857).
 7. Conglong Li, Ammar Ahmad Awan, Hanlin Tang, Samyam Rajbhandari, Yuxiong He. (2021) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. [arXiv:2104.06069](https://arxiv.org/abs/2104.06069).
+8. Conglong Li, Minjia Zhang, Yuxiong He. (2021) Curriculum Learning: A Regularization Method for Efficient and Stable Billion-Scale GPT Model Pre-Training. [arXiv:2108.06084](https://arxiv.org/abs/2108.06084).
 
 # Videos
 1. DeepSpeed KDD 2020 Tutorial

@@ -8,6 +8,10 @@ class CurriculumScheduler(object):
     def __init__(self, config):
         super().__init__()
         self.state = {}
+        assert "curriculum_type" in config, "Curriculum learning requires the config 'curriculum_type'"
+        assert "min_difficulty" in config, "Curriculum learning requires the config 'min_difficulty'"
+        assert "max_difficulty" in config, "Curriculum learning requires the config 'max_difficulty'"
+        assert "schedule_type" in config, "Curriculum learning requires the config 'schedule_type'"
         self.state['min_difficulty'] = config['min_difficulty']
         self.state['max_difficulty'] = config['max_difficulty']
         self.state['current_difficulty'] = config['min_difficulty']
@@ -25,6 +29,12 @@ def __init__(self, config):
             The self.state['schedule'] is a dictionary of
             difficulty : [max step for this difficulty, next difficulty].
             """
+            assert "difficulty" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'difficulty'"
+            assert "max_step" in config['schedule_config'], "Curriculum learning with fixed_discrete schedule requires the schedule_config 'max_step'"
+            assert len(config['schedule_config']['max_step']) > 0
+            assert len(config['schedule_config']['difficulty']) > 0
+            assert len(config['schedule_config']['difficulty']) == len(
+                config['schedule_config']['max_step']) + 1
             self.state['schedule'] = {}
             for i in range(len(config['schedule_config']['max_step'])):
                 self.state['schedule'][config['schedule_config']['difficulty'][i]] = \
@@ -49,6 +59,9 @@ def __init__(self, config):
               "root_degree": 2
             }
             """
+            assert "total_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'total_step'"
+            assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'difficulty_step'"
+            assert "root_degree" in config['schedule_config'], "Curriculum learning with fixed_root schedule requires the schedule_config 'root_degree'"
             self.state['schedule'] = config['schedule_config']
         elif config['schedule_type'] == 'fixed_linear':
             """
@@ -59,6 +72,8 @@ def __init__(self, config):
               "difficulty_step": 8
             }
             """
+            assert "total_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'total_step'"
+            assert "difficulty_step" in config['schedule_config'], "Curriculum learning with fixed_linear schedule requires the schedule_config 'difficulty_step'"
             self.state['schedule'] = config['schedule_config']
         else:
             raise RuntimeError('Unsupported curriculum schedule type')

@@ -36,6 +36,7 @@ collections:
       - bert-finetuning.md
       - bert-pretraining.md
       - cifar-10.md
+      - curriculum-learning.md
       - flops-profiler.md
       - gan.md
       - lrrt.md

@@ -70,6 +70,8 @@ lnav:
         url: /tutorials/bert-pretraining/
       - title: "CIFAR-10"
         url: /tutorials/cifar-10/
+      - title: "Curriculum Learning"
+        url: /tutorials/curriculum-learning/
       - title: "Flops Profiler"
         url: /tutorials/flops-profiler/
       - title: "GAN"

@@ -716,3 +716,79 @@ Configuring the asynchronous I/O module for offloading parameter and optimizer s
     "num_sliding_window_blocks": 3
   }
 ```
+
+### Curriculum Learning
+```json
+  "curriculum_learning": {
+    "enabled": true,
+    "curriculum_type": "seqlen",
+    "min_difficulty": 8,
+    "max_difficulty": 1024,
+    "schedule_type": "fixed_linear",
+    "schedule_config": {
+      "total_step": 40000,
+      "difficulty_step": 8
+    }
+  }
+```
+<i>**enabled**</i>: [boolean]
+
+| Description                               | Default |
+| ----------------------------------------- | ------- |
+| Set to true to enable curriculum learning | `false` |
+
+<i>**curriculum_type**</i>: [string]
+
+| Description                                                       | Default |
+| ----------------------------------------------------------------- | ------- |
+| Type of curriculum difficulty metric. Currently support `seqlen`. | N/A |
+
+
+<i>**min_difficulty**</i>: [integer]
+
+| Description                   | Default |
+| ----------------------------- | ------- |
+| The starting difficulty level | N/A |
+
+<i>**max_difficulty**</i>: [integer]
+
+| Description                 | Default |
+| --------------------------- | ------- |
+| The ending difficulty level | N/A  |
+
+<i>**schedule_type**</i>: [string]
+
+| Description                                                                                        | Default |
+| -------------------------------------------------------------------------------------------------- | ------- |
+| Type of curriculum schedule. Currently support `fixed_linear`, `fixed_root`, and `fixed_discrete`. | N/A |
+
+
+<i>**total_step**</i>: [integer]
+
+| Description                                                     | Default |
+| --------------------------------------------------------------- | ------- |
+| Total number of steps for the curriculum learning. One of the `schedule_config` when the `fixed_linear` and `fixed_root` schedule_type are used. | N/A |
+
+<i>**difficulty_step**</i>: [integer]
+
+| Description                                                     | Default |
+| --------------------------------------------------------------- | ------- |
+| At any time, the curriculum learning difficulty must be multiple of this `difficulty_step`. Set this to multiple of 8 (for FP16 data) or 16 (for INT8 data) to enable NVIDIA Tensor Core acceleration. One of the `schedule_config` when the `fixed_linear` and `fixed_root` schedule_type are used. | N/A |
+
+<i>**root_degree**</i>: [integer]
+
+| Description                                                     | Default |
+| --------------------------------------------------------------- | ------- |
+| Root degree of the curriculum schedule function. One of the `schedule_config` when the `fixed_root` schedule_type is used. | N/A |
+
+<i>**difficulty**</i>: [list of integer]
+
+| Description                                                     | Default |
+| --------------------------------------------------------------- | ------- |
+| List of difficulty levels to be used during schedule. One of the `schedule_config` when the `fixed_discrete` schedule_type is used. | N/A |
+
+<i>**max_step**</i>: [list of integer]
+
+| Description                                                     | Default |
+| --------------------------------------------------------------- | ------- |
+| List of which step to change difficulty level. One of the `schedule_config` when the `fixed_discrete` schedule_type is used. | N/A |
@@ -241,6 +241,9 @@ DeepSpeed abstracts away data parallelism and model parallelism from the user wh
 comes to data loading. Users simply provide a PyTorch dataset, and DeepSpeed data loader
 can automatically handle batch creation appropriately.
 
+## Curriculum Learning
+Please refer to the [Curriculum Learning](/tutorials/curriculum-learning/) tutorial.
+
 ## Performance Analysis and Debugging
 
 DeepSpeed provides a set of tools for performance analysis and debugging.